DELHI POLYTECHNIC 

LIBRARY 

\ 

CLASS NO!... 

BOOK NO. ^ ^ ^. 

ACCESSION NO.?! S .. V W.„ 





Elements of Statistical Method 




ELEMENTS 

OF STATISTICAL METHOD 


Albert E. Waugh 

Professor of Economics 
University of Connecticut 


THIRD EDITION 


New York Toronto London 
McGRAW-HILL BOOK COMPANY, INC. 


1952 



ELEMENTS OF STATISTICAL METHOD 


Copyright, 193$, 1943, 1952, by the McGraw-Hill Book Company, Inc. 
Printed in the United States of America. All rights reserved. This book, or 
parts thereof, may not be reproduced in any form without permission of the 
publishers. 


Library of Congress Catalog Card Number: 51-13600 



This book is affectionately 
dedicated to 

Alexander E. Cance 
and 

Irving G. Davis 

inspired teachers, tireless seekers after 
new things, gentlemen 




PREFACE 


It is now fourteen years since the first edition of this book 
appeared. I was gratified at the time by the reception accorded 
it by the reviewers, and I have been pleased ever since to find a 
continuing demand, both for the original and for the second edi¬ 
tion, which indicates that my colleagues have found the book 
useful. I am deeply indebted to those who have offered sug¬ 
gestions or criticisms. Particularly am I indebted to Professor 
Roy W. Jastram of Stanford University and to Professor Fred¬ 
erick E. Croxton of Columbia University who read the entire 
manuscript of earlier editions and gave invaluable suggestions 
for improvement. 

Those familiar with the book in earlier editions will discover 
that the major revisions are in the last half of the book. A 
chapter has been added introducing the student to the ideas of 
analysis of variance. Chapters on curve fitting have been 
entirely reworked and made introductory to the chapters on 
correlation. The elementary work on graphics and the collec¬ 
tion of data which appeared in earlier editions have been omitted. 

In the main, however, the purpose and method of the book 
remain unchanged. As the first edition announced, “ This book 
is planned for the beginner in the field who has yet to learn ‘what 
it is all about.* No attempt has been made to treat any aspect 
of the field exhaustively, and advanced students will find it 
necessary to consult other books and, particularly, to acquaint 
themselves with articles in the current technical statistical 
journals. The aim of this book is to introduce the student to 
statistical concepts and statistical nomenclature and to get him 
to think in statistical terms. 7 * 

The great difficulty in writing any textbook is to know what to 
omit—to keep in mind about how large a body of material a begin¬ 
ning student can be reasonably expected to assimilate in a semes¬ 
ter. In many cases where the limitations of space and the 
student's time make it impossible to cover a subject in detail (in 
advanced correlation analysis, for example) a special effort has 
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been made to cover the basic ideas in such a way that the thought¬ 
ful student will find it easy to continue under his own power. 
Every effort has been made to keep the book on the beginner’s 
level. It has been written primarily to help the beginner, and not 
to impress the mature statistician. 


Albert E. Waugh 

Storks, Conn. 

February , 1952 
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CHAPTER I 


THE NATURE OF STATISTICS 

1.1. Scientific Method. —Men have discovered facts in many 
ways. Some things have been learned by chance. Wisdom, so 
it is said, has been imparted to men in dreams and through 
miraculous revelation. There are those who assert that they are 
gifted with clairvoyance, by means of which they are able to 
discern things hidden from the ordinary mortal. And, of course, 
a large portion of all human knowledge has come down to us from 
unknown sources. The method by which it was originally dis¬ 
covered is not known. 

The tremendous advances in human knowledge that have 
characterized the past century and a half have not come, in the 
main, from any of the sources above mentioned. Nowadays a 
great preponderance of the additions to our information are the 
result of plan and not the product of chance. We learn new 
things because we have gone about the learning process methodi¬ 
cally, and not haphazardly. The methods which are used in 
acquiring knowledge are called scientific methods , and it should 
be remembered that they do involve the following of a definite 
plan. 

1.2. Experimental Method. —The best known of the scientific 
methods, and the one which has been most fruitful, is called the 
experimental method. Galileo, we are told, was attracted by the 
swinging of lamps in the cathedral. He noted that the period 
of the swing varied, and he wished to determine what factors 
influenced the length of the period. He did not rely on dreams, 
nor did he, so far as we know, patronize the local fortuneteller. 
He began to experiment. He went at the problem methodically, 
in an attempt to determine what forces were at work in the 
pendulum. 

Now Galileo might have taken the first half-dozen pendulums 
that he encountered and studied them. If he had done so 
he would have found that the pendulums differed in many 
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respects. In some the bob would be heavier than in others. 
In some the length would be greater than in others. In some 
cases there might be air currents which were not present in 
others. At the points where the pendulums were located there 
might be differences in air pressure, relative humidity, the 
attraction of gravity, etc. And under such circumstances when 
Galileo found that pendulum 1 oscillated more rapidly than 
pendulum 2 he would be unable to tell whether the difference 
was due to differences in length, weight, humidity, air pressure, 
or some combination of these forces. Hence Galileo was very 
careful to construct pendulums that differed in but one respect; 
that is, he would make several pendulums of the*same length; 
he would protect them all from currents of air that might affect 
the rate of swinging; he would operate them all at the same time 
and place so that there would be no differences in barometric 
pressure and the like. In fact, these pendulums would be exactly 
the same and would be operated under identical conditions, 
except, let us say, that there would be variations in the weight of 
the bob. Under these circumstances if Galileo found that there 
were regular variations in the period of oscillation which corre¬ 
sponded to differences in the weight of the bob, he would know 
that the cause must be bob weight, since no other factors differed. 
If, on the other hand, these pendulums did not differ in period, 
he would know that bob weight did not affect the rate of swinging. 

Having discovered the effect, if any, which was associated 
with the weight of the bob, Galileo would now construct pen¬ 
dulums which differed in nothing except length, and would 
ascertain the relationship of length to period. He would then, 
in turn, discover the effect of gravitational attraction, barometric 
pressure, etc. The important thing to note is that the method 
consists in keeping all forces save one constant, and in varying 
that force in order that the scientist may discover its effect, if 
any. This method of investigation is called the experimental 
method, and where it is applicable scientists prefer it to all other 
methods. 

1.3. Statistical Method. —But often men wish to discover facts 
in fields where the experimental method cannot be applied. Let 
us suppose, for example, that you wish to discover the forces 
that determine the price of milk in New York City. You 
would like to apply the experimental method. This means that 
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you would have to try one thing at a time, keeping all other 
things constant, and note the effect of the changes. First 
you would make changes in the quantity of milk offered on the 
market, to determine whether or not the price varied as you 
varied the amount offered. But it would be necessary for you 
to see that there were no changes in the other factors. You 
would have to establish entire uniformity in people’s wages, since 
presumably the amount that they will pay for milk depends 
partly on the amount of their incomes; you would have to make 
sure that the tastes of consumers remained constant, since 
changes in their desires would perhaps change the amount that 
they would pay; you would find that you were forced to fix the 
general price level so that there would be no changes from varia¬ 
tions in the purchasing power of money; etc. But this is mani¬ 
festly out of the question. Here the experimental method cannot 
be applied, because the many factors cannot be held constant 
as the scientist varies the one force in which he is interested 
for the moment. Thus in the social and biological sciences 
it is often impossible to make use of the experimental method. 

We should be foolish, however, to neglect entirely those 
fields in which experimental method is out of the question. 
To be sure, it is very difficult to discover facts in such fields, 
but it is none the less desirable that facts be discovered, and 
the scientist must do what he can in the face of difficulties. He 
could, of course, fall back on chance or revelation. In such a 
case he would have left the field of science entirely, since he 
would be following no plan. As a matter of fact he is likely to 
fall back on another method, as a poor second choice. This 
other method (or body of methods) we call statistical method . 
When we apply statistical method to a problem we go at the 
problem systematically, as in the experimental method; but the 
system used is not the same. Being unable to hold forces 
constant, we perforce let them vary. But now we record the 
variations in all the forces operating and attempt to determine 
the separate part which each plays in influencing the result. 
Under ordinary circumstances this method is much more difficult 
than the experimental method, and the results obtained are 
usually less accurate and less satisfying. But they are much 
better than no results at all. 

The classification of scientific methods into those which are 
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experimental and those which are statistical, like most classifica¬ 
tions, is formal and arbitrary and not entirely realistic. When 
the scientist comes to work on a problem in practice, he almost 
always combines elements of both the statistical and the experi¬ 
mental approach. Many of the most important of the statistical 
methods were originated in the fields of physics and astronomy— 
fields that we usually think of as “exact” sciences. Even in 
these fields the scientist has to contend with errors of observation, 
and in addition he usually finds it impossible to record the values 
of all the variables which are involved. 1 Under such circum¬ 
stances the “exact” scientist is forced to combine statistical 
methods with his experimental procedures. On the other hand, 
even the social scientist can and does use a certain amount of 
control in his investigations. 

1.4. Statistics. —The word “statistics” is used in two senses 
which differ materially. It is sometimes used in the singular 
and sometimes in the plural. When used in the plural, it refers 
to numerical data. Thus if we say that there are statistics 
in the “World Almanac” we mean that there are numerical 
data there. When we use the word in the singular, we refer to a 
body of methods which are used in summarizing such numerical 
data. Statistics is a body of methods which are used when we 
wish to study masses of numerical data and to extract from them 
a few simple facts. 

Originally statistics were gathered for public purposes. In 
fact, the word “statistics” and the word “state” come from the 
same root. Statistics were gathered for purposes of taxation 
and for military purposes. But nowadays there is almost no 
field in which statistics are not useful. Every science depends 
to some extent on the gathering of data and on the application 
to them of statistical methods. In some fields, as has been 
pointed out above, the statistical methods are almost the only 
methods that can be used, while in other fields they are a minor 
supplement to other scientific methods. 

1.5. Preliminary Admonitions. —It is important to remember 
that the purpose of statistical method is to simplify great bodies 
of numerical data. If you were shown a table containing 1000 
figures, each figure representing the weight of a newborn baby, 
you would be confused by the very mass of material itself. 

1 See Chap. II for a discussion of these problems. 
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But if these 1000 figures could be boiled down to one or two, 
you would comprehend them quickly. Thus if we discover 
that the average weight of girl babies at birth is 7.1 lb. and of 
boy babies 7.6 lb., we have derived two figures from the original 
1000 and have made much simpler the problem of understanding 
the original data. To be sure, we must not be misled into believ¬ 
ing that the original data were so simple as our conclusions. We 
must not come to the conclusion that each girl baby weighed 
7.1 lb. and that each boy baby was just 3^ lb. heavier. We 
have given up some of the detail of our original figures in order 
that we may get a simple, convenient, and easily understood 
general statement. 

It is the purpose of statistical methods thus to simplify data. 
In too many cases students who have studied but a little statistics 
lose sight of this fact and come to believe that the purpose of 
statistics is to mystify the uninitiated. They try, by the use of 
uncommon terms and symbols, to impress the layman with 
their own erudition. Such an attitude shows complete lack of 
comprehension on the part of the student. Unless data are 
simpler and easier to understand after statistical methods have 
been applied than they were before, the time and trouble of 
applying the methods have been wasted. If statistical methods 
make data more complicated and harder to understand they are 
worse than useless. The student should try, in the case of each 
method discussed in this book, to see just how it makes masses 
of data simpler and more easily understood than they would 
otherwise be. 

It is also wise to caution the student who is beginning the 
study of statistics that statistical methods cannot, in themselves, 
solve any problem. The original data must have been accurate, 
the methods must have been properly applied, and the results 
must have been interpreted by one who understands both the 
methods themselves and the field to which they have been 
applied. Many a student feels that if he takes data—any data— 
and performs certain mystic necromancies, he will get results 
which, by some unknown power, are correct. He has them 
down on paper, in black and white, carried to seven decimal 
places, and hence they must be right. He feels, with Mephisto’s 
pupil, Derm was man schwarz auf weiss besitzt , kann man getrost 
nach House tragen , It is important to realize the fact that no 
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statistical method can, in itself, insure against mistakes, inaccu¬ 
racy, or faulty reasoning and incorrect conclusion. These 
methods are to be thought of as tools which, when in proper 
hands and when applied to the materials for which they are 
designed, can turn out useful products, but which have no powers 
to work wonders by themselves. 

1.6. Suggestions for Further Reading.—Any textbook on logic will 
describe some of the characteristics of the various scientific methods. In 
addition, there are a number of interesting and instructive books on the 
subject of the scientific method itself. Karl Pearson’s “Grammar of 
Science,” now available in the handy Everyman’s Library, is one of the 
best. For several interesting illustrations of pure chance discoveries, see 
Professor W. B. Cannon’s article, The Role of Chance in Discovery, Scien¬ 
tific Monthly , Vol. 50, No. 3, March, 1940, pp. 204-209. 



CHAPTER II 


THE MEANING OF NUMBERS 

It is impossible for us here to go into the philosophy of number 
theory, nor is it our intent to develop the history of the number 
concept. Before we work with numbers it is important, however, 
to understand just what we do and do not mean when we express 
facts in numerical form. 

2.1. Accuracy of Measurement. —Most of the numbers that we 
use in scientific work represent measurements. These measure¬ 
ments are made with various kinds of instruments, varying from 
such relatively simple things as a foot rule to complicated appa¬ 
ratus such as that used to measure the speed of light. Yet no 
measuring instrument is completely accurate, nor is the operator 
of any measuring device completely dependable. Two men will 
read an instrument in slightly different ways, or the same man 
will read the same instrument in different ways at different 
times. The accuracy of a measurement will depend in part 
on the skill and the carefulness of the person making it. It will 
also depend in part on the instrument used. Some scales will give 
readings to the nearest pound, some to the nearest ounce, some 
to the nearest milligram. Yet even the finest, most delicate, 
most precise of scales has somewhere a limit of accuracy beyond 
which it cannot go. Similarly an instrument for measuring 
lengths may be as crude and inaccurate as the hand lead line used 
by mariners to ascertain the depth of water, which has markings 
at 2, 3, 5, 7, 10, 13, 15, 17, and 20 fathoms. Since a fathom is 
6 ft., one might be able to discover with such a line that the depth 
of the water was between GO and 78 ft. (10 and 33 fathoms), but 
for greater accuracy he would have to rely on his ability to esti¬ 
mate rather than on the instrument used. On the other hand, in 
manufacturing many kinds of machinery the tolerances are well 
below one hundredth of an inch, and various ingenious sorts of 
instruments have been developed which will distinguish and 
record lengths far smaller than this. Yet again, even the best 
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of these measuring instruments has somewhere a limit to its 
accuracy. 

This means, of course, that no measurement is ever exactly and 
entirely accurate save by chance. To be sure, I might cast the 
lead line and announce a depth of 14 fathoms. This depth might 
be just exactly correct—yet if so it is not only a matter of chance, 
but I would not have any way of knowing that it was exactly 
correct. When, if ever, a measurement is exactly accurate, we 
do not know it. 

Seemingly different from these cases are those in which num¬ 
bers are the result of counting. If I say that there are 10 people 
in a room, the figure 10 is supposed to be absolutely exact, and not 
an approximation. Yet in practice even this distinction is more 
apparent than real. The Bureau of the Census announces the 
1950 population of the United States as 150,097,301. This 
purports to be the result of a count. Yet everyone knows that 
the result is probably inaccurate. If it is accurate, we have no 
way of knowing it. The census was taken as of Apr. 1, 1950. 
But the population was different at different times of day. Some 
people surely were not counted at all, and some people were 
probably counted twice. Some people doubtless reported the 
size of their families on the day the census taker came rather 
than as it was on Apr. 1. There are many probable causes of 
error even in a count, and when the numbers involved are large, 
we know that these numbers, also, are only approximations. 

Let us start, then, with the understanding that the numbers 
used in scientific work are almost always 1 merely approximations. 

2.2. Biassed and Compensating Errors. —The errors in meas¬ 
urement of which we have just been speaking arc of two funda¬ 
mentally different kinds. If 50 students each measure the length 
of a desk with a foot rule, they will not get the same answers. A 
foot rule is not an accurate measuring device, and while they may 
all agree on the number of whole feet, or even of whole inches, 
when they get down to sixteenths of an inch there is sure to be 
some variation. Such chance errors in measurement, however, 
are subject to the peculiarity that some of the measurements will 

1 The exceptions are, of course, those cases where we are dealing with 
small numbers which are the result of counting, as when a man says he has 
12 children. These must also be cases involving discreto rather than con¬ 
tinuous data. These terms are defined on p. 56. 
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ordinarily be too large and some too small—some of the errors 
will be positive and some negative. The errors will tend to 
balance each other, so that if we have measurements enough the 
errors will disappear on the average. This fact has given rise 
to the humorous remark that statisticians like errors, and the 
more errors they get, the better they like it, because the more of 
them there are, the more chance there is that they will balance. 
While this is hardly true, nevertheless we do recognize the fact 
that errors in measurement are inevitable, but that errors arising 
from pure chance tend to cancel each other out. 1 We call such 
errors compensating errors , accidental errors , unbiassed errors y or 
random errors. 

In contrast with errors of the type just described, we have some 
errors which persist, occurring over and over again, and always 
in the same direction. Suppose, for example, that a clock runs 
too slowly. We will not get a correct measurement of time, even 
if we measure a great many hours, since the error will always be 
in the same direction. Suppose that the grocer's scales are out of 
adjustment, so that they always register a pound when 13 oz. is 
put on the pan. In such a case it will do no good to weigh a great 
many pounds of sugar with the idea that the errors will balance 
and cancel out. The errors are all in the same direction. It is 
said that women are prone to understate their ages. If this is so, 
a sociological investigation that includes a question relative to 
women's ages will not be self-correcting even if thousands of 
women are asked to state their ages. If a foot rule is too short, 
being in fact only 11.5 in. long even though it is marked 12 in., we 
will not get correct measurements, even if we take the average 
of thousands of measures. Poorly adjusted or incorrectly made 
measuring instruments are often the cause of such biassed error , 
systematic error , constant error , persistent error , or cumulative error . \ 
These errors tend to occur always in the same direction, and large 
numbers of measurements do not help in their case. 

Of course, systematic errors do not always arise from faulty 
instruments. They may arise from lack of skill in reading the 
instrument. It has long been known, for example, that when 
astronomers watch a star very carefully through a transit 
instrument, intending to record the exact time that it crosses the 

1 The errors tend to be distributed in what we shall call a “normal dis¬ 
tribution ” in Chap. VII. 
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cross hair in the instrument, some observers tend always to record 
the transit just before it really takes place, some just after it 
really takes place, and some at approximately the right time. 
Some individuals thus have a biassed error in one direction and 
some in another, while others seem to have compensating errors. 
A popular magazine reported a few years ago 1 that when a large 
number of people were asked to estimate the length of a minute 
by ringing a gong at the beginning and end of their estimates their 
average guess was only 35 sec. If all or most of these people 
tended to err in the same direction, the error was a biassed one, as 
it very evidently was in this case. Bias may come from the 
unwillingness or the inability of people to give correct informa¬ 
tion, or from peculiar individual traits which lead an observer to 
read a scale incorrectly but always too high or too low. Biassed 
error can sometimes be discovered and eliminated. Random 
error can never be eliminated, although its effects may be reduced 
by getting large numbers of observations. 

2.3. Significant Figures. —In abstract arithmetical work when 
one uses the number 15 he means just exactly 15—no more and 
no less. The number 15 and the number 15.0000 are assumed to 
mean the same thing. But we have just learned that in scientific 
work a number is seldom exact. When a scientist uses the 
number 15 he means “approximately 15.” The convention 
which has been adopted in all the various fields of scientific work 
is that the scientist will write down as many digits as he knows, 
and then add zeros enough to locate the decimal point. Usually 
the last digit other than zero is an approximation which is correct 
to the nearest place. When an elementary physics book gives 
the speed of light as 186,000 miles per second, it is understood 
that the digits 1, 8, and 6 represent measurements, although the 
last digit, 6, is probably an approximation. The three zeros are 
not measurements at all. They are put there merely so that we 
shall know the position of the decimal point. In fact, Newcomb 
and Michelson’s 2 determination of the speed of light is 186,324 
miles per second. It is at once evident that the three zeros in 
the number 186,000 were not measured zeros. They merely 
indicated that the measurement was in thousands of miles 
rather than in miles. And even the more accurate figure 186,324 

1 Collier's Magazine , Vol. 108, No. 4, July 26, 1941, p. 8. 

* “American Ephemeris and Nautical Almanac,” 1940, p. xx. 
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is probably not exactly correct. It meaps that the speed lies 
nearer to 186,324 than to 186,323 or 186,325. Thus we know 
only that the speed lies between 186,323.5 and 186,324.5 miles 
per second. The statement that the speed of light is 186,000 
miles per second is taken by the scientist to mean, not that it is 
exactly 186,000 miles, but nearer to 186,000 than to 185,000 or 
187,000. It is correct to the nearest thousand miles. It lies 
between 185,500 and 186,500 miles. 

As we change the form of our statement, increasing its accu¬ 
racy, we could give the speed of light successively as 190,000; 
186,000; 186,300; 186,320; and 186,324 miles per second. Each 
time we get a little closer to the fact, but we never get the exact 
measurement save by chance. As we get more and more accu¬ 
rate in our statement, coming progressively closer to the exact 
figure without ever getting there, we say that we get more and 
more significant figures. 

Webster's Dictionary defines significant figures as “ figures that 
remain to a number or decimal after the ciphers at the right or left 
are canceled.” Thus the number 1000 has one significant figure; 
the number 900 has one significant figure; the numbers 910 and 
912 have, respectively, two and three significant figures. The 
following numbers have the number of significant figures indi¬ 
cated, in conformity with the rule given in Webster’s definition: 

Significant 

Number Figures 

200 1 

20 1 

2 1 

0.2 1 

0 02 1 

0 002 1 

210 2 

217 352 6 

Webster’s definition, however, falls down in some cases. The 
number 321.4500 would have five significant figures if we can¬ 
celed the zeros at the extreme right. But these zeros were not 
necessary to locate the decimal point. To be more accurate we 
could say: 

1. Every digit except zeros is always significant. 

2. Zeros are always significant unless: 
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a, They are at the extreme right of a number and to the left of the 
decimal point. 

b. They are at the extreme left of a number. 

Thus, in the number 32,056, the zero is not at the extreme 
right or the extreme left, and it is therefore significant. The 
number has five significant figures. In the number 230.00, the 
zeros are at the extreme right, but they are not at the left of 
the decimal point. Hence they are significant, and the number 
has five significant figures. In the number 186,000, the zeros 
are at the extreme right and at the left of the decimal point, so 
they are not significant, and the number has three significant 
figures. In the number 0.003, the zeros are at the extreme left 
and are not significant. The number has one significant figure. 

We can state the rule in another form by saying that all digits 
are significant figures except zeros which had to be included to 
show the location of the decimal point. In the examples in the 
preceding paragraph, it is clear that, the zero in the number 
32,056 was a measured zero, and was not put. in to locate the 
decimal point. 

Zeros at the extreme left of a number are always insignificant, 
since there is no other reason for using them than to locate the 
decimal point. But with the number 230.00, the final zeros did 
not have to be put in to locate the decimal point. They should 
not be put in at all unless they represent measurements. The 
number 230 and the number 230.00 do not mean the same thing 
in scientific work, even though in pure mathematics they are the 
same. In science the number 230 means that a measurement lies 
between 225 and 235; while the number 230.00 means that the 
measurement lies between 229.995 and 230.005. It is obvious 
that the figure 230.00 represents a far more accurate measurement 
than the figure 230. The convention in scientific circles is to 
write only as many digits as are known to be correct, adding 
enough zeros to locate the position of the decimal point if its 
position would not otherwise be evident. Digits so written (not 
including the zeros added merely to locate the decimal point) are 
called significant figures. 

The student sometimes feels that the number 0.000324 must 
represent a very accurate measurement on account of the zeros 
which precede it. We have said that these zeros are not signifi¬ 
cant, Suppose you have measured a distance and found it to be, 
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as nearly as you can tell, 324 mm. You can express this measure¬ 
ment also by saying that it is 0.324 m. or that it is 0.000324 km. 
You have not increased the accuracy of the measurement by 
using larger units (kilometers instead of meters). You have 
expressed exactly the same measurement in three different ways, 
and each of the three numbers 324; 0.324; and 0.000324 has the 
same number of significant figures, namely, three. 

2.4. Standard Notation. —Sometimes, to be sure, we may have 
zeros at the right-hand end of a number when they really are 
significant. Suppose I measure a distance and find that it is 
100 ft. to the nearest foot. I know that it lies between 99.5 and 
100.5 ft Under these circumstances the two zeros really repre¬ 
sent measurements, yet under our rules they would be called 
nonsignificant. Or, of course, one of the zeros might be signifi¬ 
cant while the other was not, if I had found the distance to be 
between 95 and 105 ft. to the nearest 10 ft. Thus we see that the 
number 100 may have one, two, or three significant figures, 
depending on the actual accuracy of the original measurement. 

Often we can tell whether such zeros are significant or not by 
the context. Suppose, for example, that 1 am given a column of 
figures showing the total amount of $20 bills circulating at the 
ends of various years, as follows: 


Year 

Amount 

(millions) 

1940 

$1,800 

1942 

4,096 

1944 

7,224 

1946 

9,310 

1948 

8,846 

1950 

8,529 


I note that the entries for 1940 and 1946 appear by our rules to 
have two and three significant figures, respectively; yet it is fairly 
safe to assume that all the numbers are given to the nearest 
million dollars, and that these final zeros are significant. 

When one wishes to show which zeros are significant and which 
are not, it can be done easily by means of what is called standard 
notation. Our system of enumeration is based on the radix 10, 
and every number in our system can be stated in the form of some 
number multiplied by a power of 10. For example, the number 
20 is 2 X 10; the number 200 is 2 X 100, or 2 X 10 2 ; the number 
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234 is 2.34 X 100, or 2.34 X 10 2 . Here are several other num¬ 
bers, each stated in the usual form and also in standard notation: 

Usual Form Standard Notation 


325 

1. 

5325 

X 

10< 


21 

2 

1 

X 

10*, or 2.1 

X 10 

2 1 

2 

1 

X 

10°, or 2.1 

X 1 

0.21 

2 

1 

X 

io- 1 


0 021 

2 

1 

X 

10' 2 


0.0021 

2. 

,1 

X 

10“ 3 



The student will see at once that when the exponent of 10 is 
positive it means, “move the decimal point so many places to the 
right.” Thus 1.5320 X 10 s means 1.5326 with the decimal point 
moved three places to the right, or 1532.6. When the exponent 
of 10 is negative, it means “move the decimal point so many 
places to the left.” Thus 2.345 X 10“ 4 means 2.345 with the 
decimal point moved four places to the left, or 0.0002345. 

Now let us go back to the problem that we raised at the begin¬ 
ning of this section. 1 Tow can we indicate that one of the zeros is 
significant and one is not in the number 100? First we write it 
in standard form, when it could appear in any of the following 
ways: 

1 X 10* 

1.0 x 10 s 

1.00 X 102 

When one thinks of the numbers as pure numbers, these seem to 
be the same. But the first has one significant figure, the second 
has two, and the third has three. We note that the second num¬ 
ber includes the expression 1.0. In this number the zero is signi¬ 
ficant under the rules as given on pages 11-12. In the number 
1.00 both zeros are significant under our rules. Therefore we see 
that when we write 1 X 10 2 we mean that only the figure 1 is 
significant; if we write 1.0 X 10 2 we mean that the 1 and one 
of the zeros are significant; and if we write 1.00 X 10 2 we mean 
that all three digits are significant. 

To write a number in standard notation (also called scientific 
notation ), we write the first digit followed by a decimal point, and 
we then write such other digits as are significant, finally multiply¬ 
ing the entire number by whatever power of 10 is necessary to 
put the decimal point in its proper place. Under ordinary 
circumstances numbers in standard notation have but one digit at 
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the left of the decimal point. There are, however, two excep¬ 
tions to this rule. First, comparison of several numbers is often 
easier if they are all multiplied by the same power of 10. Thus 
we may say that the mean distance between the sun and the 
earth is 9.287 X 10 7 miles while the mean distance between 
Uranus and the sun is 1.782 X 10 9 miles; but if we had given 
the two numbers as 9.287 X 10 7 and 178.2 X 10 7 it might have 
been more immediately clear that Uranus is roughly 20 times as 
far from the sun as is the earth. The distances are to each other 
as 9.287 is to 178.2. In the second place we vary our rule for 
standard notation when we are expecting to take roots of the 
numbers. In such cases we purposely select a power of 10 so 
that the index of 10 will be evenly divisible by the root we 
are to take. For example, if we want to take the cube root of 
4.625 X 10 8 we write the number as 462.5 X 10 6 . The cube 
root of 462.5 X 10 6 is ^462.5 X 10** which is 7.73 X 10 2 . 
Similarly, if we wanted to find the square root of 7.39 X 10 5 we 
would state our number as 73.9 X 10 4 , and the square root would 
be V73.9 X 10**, or 8.60 X 10 2 . In taking the nth root of a 
number we divide the index of 10 by n, and therefore we prefer to 
write the number originally in such a form that the index of 10 
will be evenly divisible by n. 

2.5. Computations with Approximate Numbers.—The rules 

that we all learn for the simple arithmetical procedures of addi¬ 
tion, subtraction, multiplication, and division are based on the 
assumption that we are using “pure numbers”—numbers that 
are exactly accurate and mean exactly what they say. But when 
we use numbers that are merely approximations, as we almost 
always do in science, these rules are likely to give us misleading 
conclusions. For example, suppose you are to find the average 
of the numbers 7, 4, and 6. By common arithmetical methods 
you would add the three numbers and divide by 3 to get 

i% - 5.666666666666 . . . 

carrying out the computation to as many 6's as your patience 
would permit. But when we remember that these numbers are 
not accurate, we realize that the number 7 means “Somewhere 
between 6.5 and 7.5,” the number 4 means “somewhere between 
3.5 and 4.5,” and the number 6 means “somewhere between 5.5 
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and 6.5.” The sum of our three numbers could be as low as 
6.6 + 3.6 4 5.6 « 15.5; or it might be as large as 

7.5 + 4.5 + 6.5 - 18.5 

In the first case the average would be 15.5/3 = 5.17 approxi¬ 
mately. In the latter case the average would be 18.5/3 = 6.17 
approximately. What is the sense in carrying out the answer as 
5.666666666 . . . when we are not sure of even the first digit? 
The long line of 6’s does not represent actual measurement, and 
the numbers should not be there. They are not significant. 
They are not digits which are really known at all. We see at once 
that when our original figures are more or less inaccurate there is 
no reason to carry out computations to large numbers of decimal 
places which contain only a pretended accuracy. Therefore we 
shall need to use new rules for the simple arithmetical processes to 
adapt them for use with approximate numbers. 

2.6. Multiplication and Division of Approximate Numbers.- - 
When multiplying or dividing two or more approximate numbers 
the following rules should be used: 

1. Round off the numbers with the largest number of significant figures 
until they have but one more significant figure than does that one of the 
numbers which has the smallest number of significant figures. 

2. Multiply or divide the rounded numbers in the usual way. 

3. Round off the answer (product or quotient) until it has no more 
significant figures than has that one of the original figures which contained 
the smallest number of significant figures. 

These rules can best be understood by illustrations. 

What is the area of a table top that measures 72 X 36 in.? 
Both numbers have two significant figures. No preliminary 
rounding off is necessary. We multiply 72 X 36 — 2592. We 
round off the answer to two significant figures to get 2600 sq. in., 
which we give as the answer. The answer 2592 sq. in. should not 
be given, since it contains a pretended accuracy. When the 
original measurements are given as 72 in. and 36 in., the area may 
be anywhere between 2646.25 sq. in. (as it would be if both 
original figures had the greatest possible values) and 2538.25 
sq. in. (as it would be if both original figures had the smallest 
possible value). To pretend that we know the area to the nearest 
square inch when we do not even know it for certain within 100 
sq. in. is likely to mislead ourselves as well as others. The 
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student will notice that even our rules give slightly more signifi¬ 
cant figures in the answer than we are sure of, since they give an 
answer of 2600 sq. in. while we are not even sure of the second 
digit. If the rules for computation given here seem to the student 
to be rough, approximate, and inaccurate, he may rest assured 
that these rules do not discard any accuracy which really existed, 
but only imaginary accuracy. 

One more example: We measure the circumference of a large 
cylindrical water tower, using a foot rule. Because of the crude¬ 
ness of our apparatus we are not sure of the result to thousandths 
of inches, but in our best judgment the circumference is 530 in. 
the first two numbers only being significant. What is the 
diameter of the tower? We look in our textbooks and find that 
the diameter can be found by dividing the circumference by 
3.14159265358979323846 . . . , or we consult our memories and 
recall that we can divide by 3.1416; 3.14; or 3^f. Since our 
measurement of the circumference was inaccurate (as far as we 
can be sure) at the start, it would be foolish to use extremely 
accurate values of i r in our computation. Our first rule says to 
round off the number with the most significant figures. Our 
number with the least significant figures, 530, contains two; so we 
round off the value of ir to three significant figures—one more than 
we have in the least accurate figure. This gives us a value of 
3.14 for ir. We now divide to find 


530 

3.14 


= 168 in. 


We know at once that we do not need to carry the computation 
further, because we want the result to two significant figures only. 
We carry it one more place to discover whether it is nearer 160 or 
170, but we now round off the result and give our final answer as 
170 in. for the diameter. If we want to know the diameter with 
more accuracy, it will not help us to use a more accurate value of v 
or to carry our result to more decimal places. The only way we 
can learn the diameter more accurately is to measure the circum¬ 
ference more accurately at the start. 

In multiplying or dividing numbers expressed in stand¬ 
ard notation we follow the foregoing rules for the numbers 
themselves, but add the powers of 10 in multiplication or sub¬ 
tract them in division. For example, if we are to multiply 
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(7.56 X 10 a ) X (1.9685 X 10 8 ) we first round off the number 
1.9685 to four significant figures, and multiply 1.968 X 7.56. 
This gives 14.87808. Following our earlier rules, we round this 
off to three significant figures, giving 14.9. But since our two 
original numbers were multiplied by 10 8 and by 10 8 , our product 
must be multiplied by 10 8+8 or by 10 11 . We could express the 
answer as 14.9 X 10 11 , but to get it in the usual standard form 
we write it instead as 1.49 X 10 12 , with but one digit at the left 
of the decimal point. Similarly the student will discover that 
(1.765 X 10 3 ) X (3.4 X 10~ 6 ) gives 6.0 X 10~ a . This final zero 
is significant, and if we were to translate our answer from stand¬ 
ard into ordinary notation we should write it as 0.0060 rather 
than as 0.006. 

In division we divide the numbers themselves according to 
the usual rules, but subtract the index of 10 in the divisor from 
the index of 10 in the dividend. Thus if we have (8.592 X 10 8 )/ 
(9.3 X 10 6 ) we divide 8.59 by 9.3 to get 0.923, which we round 
off to 0.92. Noting that the indices of 10 are 8 and 6, we subtract 
the latter from the former and learn that the index of 10 in our 
answer should be 2. So we can write our answer 0.92 X 10 2 . 
Now to follow our original rule we convert this to 9.2 X 10, 
which is our final answer. The student may check the fact that 
(3.24 X 10-«)/(1.5 X 10- 9 ) gives 2.2 X 10 8 , since 

(-6) - (-9) - 3 

2.7. Addition and Subtraction of Approximate Numbers. —The 

rules for addition and subtraction of approximate numbers are 
based, not on the significant figures, but on the decade, or number 
of the column counting from the decimal point. Suppose we are 
told that the distance from New York to Chicago is 900 miles and 
L0 miles the other side of Chicago we come to a fork in the road. 
We are to take the right fork and proceed 150 ft., where we find 
a house. We walk up the front walk to the door, a distance of 
38 ft. There is a table 7 ft. 2 in. from the front door. On the 
table, 0.32 in. from the edge, is a box. How far is the box from 
New York City? We see that it would be foolish to add together 
910 miles, 150 ft., 38 ft., 86 in., and 0.32 in.—even though each 
of the measurements contains two significant figures. If we know 
the distance from New York to Chicago only to the nearest 10 
miles, we cannot start adding inches and make sense. Therefore 
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our rules for addition and subtraction of approximate numbers are 

1. Arrange the figures in a column, all in the same units (all in feet or 
miles or inches, for example) with the decimal points over one another. 

2. Find the column containing one or more nonsignificant figures which 
is farthest to the left. 

3. Round all the other figures off so that their last significant figure is in 
this column. 

4. Add or subtract in the usual way, using the rounded figures. 

5. Round off the answer (sum or difference) so that it has its last signifi¬ 
cant figure as far to the left as that one of the original numbers whose last 
significant figure was farthest to the left. 

This sounds a good deal more complicated than it is. Again we 
illustrate. 

It is estimated that the number of immigrants coming to the 
United States between the close of the Revolutionary War and 
1820 was 250,000. From 1820 to 1900 the number coming was 
19,123,606, and from 1901 to 1940 the number was 19,166,837. 
What was the total number of immigrants from the Revolution¬ 
ary War to 1940? Familiar methods would lead one to add the 
three numbers and find a total of 38,540,443. A moment's 
thought will show us, however, that we are not justified in doing 
this. Our first number is an estimate. Possibly the number of 
immigrants in this first period was really 250,001, or maybe it was 
251,395. The four zeros in the number 250,000 are not supposed 
to represent exact measurement. They represent some unknown 
numbers, which would take their places if we knew the actual 
facts. So if we set up our numbers for addition in the usual 
manner, we might substitute question marks for these zeros to 
show that we do not really know the values in those columns. 
This would give us the following problem in addition: 

25?,??? 

19,123,606 

19,166 ,837 

This is like being told to add 6 and 7 to some unknown number. 
The answer will also be unknown. Therefore we follow the rules 
just given. We round off the two latter numbers until they have 
their last significant digit in the “thousands column/’ since the 
nonsignificant digit farthest to the left (the first zero in 250,000) 
is in this column. This gives us our problem in this form: 
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250,000 

19,124,000 

19,167,000 

38,541,000 

This gives us an answer of 38,541,000 instead of an answer of 
38,540,443. The latter had an unwarranted and pretended 
accuracy in its final digits. Even our rule carries us one column 
further than we are sure of, since in the thousands column we 
added 7 and 4 to an unknown number. Hence we now round 
off the answer until its last significant figure is in the “10-thou- 
sand column,” since in our original numbers we had one case 
where the last significant figure was in this column. This gives 
us a final answer of 38,540,000 immigrants. If we state our rule 
loosely (and the meaning should now be understandable), we can 
say that we first round off to one more column than is really 
known, and then add or subtract. We then round off our answer 
to the last column that is really known. 

It is quite possible to lose many, or even most, of our significant 
figures in the process of subtraction. For example, let us ask 
how many more immigrants came to the United States from 1901 
to 1940 than came from 1820 to 1900. The original figures are 
given in the preceding paragraph, and we subtract as follows: 

19,166,837 
—19,123,606 
43,231 

We started with two numbers, each containing eight significant 
figures. Our difference contained but five significant figures. 

2.8. A Horrible Example.—Suppose you are asked to find, as accurately 
as you can, the weight of the earth. You look up the necessary original 
data and find the following: 

a. The volume of a sphere is 4.1888r 3 where r is the radius. 

b. Estimates of the polar radius of the earth vary from 6,356,079 to 
6,356,992 m. 

c. Estimates of the equatorial radius of the earth vary from 6,377,397 
to 6,378,388 m. 

d . A meter is 3.28 ft. 

e. The density of the earth is 5.5 times that of water. 

/. Water weighs 62.5 lb. per cubic foot. 

All these figures are approximations, and from them you wish to ascertain 
the weight of the earth. You decide that you will use as the radius the 
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average of the four figures given in b and c, which gives you a value of 
6,367,214 m. If you follow the rules of arithmetic, forgetting the rules 
that we have just learned, your computations will be as follows: 

1. Cube of radius is 268,136,859,576,097,196,344 cu. m. 

2. Volume is found, by multiplying (1) by 4.1888: 

1,081,279,488,591,955,926,046.7472 cu. m. 

3. Each cubic meter contains 3.28 3 cu. ft., or 

35.287552 cu. ft. 

4. Volume of earth in cubic feet is product of (2) and (3): 

38,155,706,190,222,051,486,759.9066988544 cu. ft. 

5. From e and / above, a cubic foot weighs 

343.75 lb. 

6. Weight of earth is product of (4) and (5): 

13,116,024,002,888,830,198,573,717.927731200000 lb. 

If, on the other hand, we use the rules that have been given, we note 
first that the density of the earth is given to but two significant figures, as 
5.5 times that of water. Therefore we round off all our other figures to 
three significant figures and state them in standard notation, thus: 

а. Volume of a sphere is 4.19r 8 . 

б. Radius of earth is 6.37 X 10® m. 

c. A meter is 3.28 ft. 

d. Density of earth is 5.5 times that of water. 

e. Water weighs 62.5 lb. per cubic foot. 

We then carry out our computations, remembering that when we multiply 
two numbers we add the exponents of the figure 10, and when we raise to 
the nth power we multiply the exponent by n. This gives us the following 
steps: 

1. Cube of radius is 258 X 10 1S , or 2.58 X 10 20 

2. Volume of earth is (1) times 4.19, or 

10.8 X 10 20 , or 1.08 X 10 21 

3. A cubic meter contains 3.28 s cu. ft., or 

3.53 X 10 cu. ft. 

4. Volume of earth is product of (2) and (3), or 

3.81 X 10 22 cu. ft. 

5. From d and e above, a cubic foot weighs 

3.44 X 10 s lb. 
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6. Weight of the earth is product of (4) and (5): 

13.1 X 10*\ or 1.31 X 10« lb. 

The answer which we get this way, namely, 1.31 X 10 24 lb., is as accurate 
as we can possibly get with data as inaccurate as those with which we 
started. We have saved ourselves a great deal of arithmetic and have not 
misled ourselves with fictitious accuracy. The long, tedious processes of 
ordinary arithmetic gave an answer to the 12th decimal place, which means 
that we pretend to know the weight of the earth down to the last tiny gram 
of sand; while as a matter of fact we started with a density of the earth 
about which we knew only that the figure lay between 5.45 and 5.55. The 
correct method of computation can be carried out on a slide rule with 
accuracy as great as is warranted by the original figures. It actually took 
4 Yi min. to get this answer with the slide rule. The student might try 
timing himself on even the single process of cubing 0,367,214, which we took 
in both cases to be the radius of the earth, to see how much time is saved 
by following the correct procedure. But it should be emphasized that the 
rules given in this chapter are not intended to save time . Timesaving is just 
a pleasant by-product. The rules are used to keep us from giving a purely 
fictitious accuracy to our results. The rules are intended to keep us from 
saying that we know something when we really merely conjecture it. A 
very large part of the data obtained with ordinary measuring instruments 
is inaccurate enough at the start so that we introduce no new inaccuracy if 
we carry out our computations with a slide rule. 

2.9. Rounding Off Numbers.—In rounding off numbers the 
accepted practice is to leave the last digit retained just as it is if 
the quantity rounded off amounts to less than half of one of the 
units retained, but to increase by one unit the last digit retained 
if the quantity rounded off exceeds half of one of the units 
retained. If the quantity rounded off amounts to exactly one- 
half a unit, the convention is to leave unchanged the last digit 
retained if it is even, but to increase it by unity if it is odd. This 
means that the last unit of the rounded number is left even if it 
was even or is made even if it was odd—both times, of course, if 
the quantity rounded off is exactly half a unit. The purpose 
of these rules is to increase the number kept half the time and 
leave it unaltered half the time, on the theory that such a pro¬ 
cedure will tend to balance the positive and the negative errors 
on the average over any large number of cases. We can illustrate 
the rules as follows: 

Round off each of the following numbers to three significant figures: 

236,941 is rounded off to 237,000 since the part rounded off (941) is 
greater than one-half (500). The last unit kept is in the thousands, and 
941 exceeds half a thousand. 
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236,241 is rounded off to 236,000 since the part rounded off (241) is less 
than half a thousand, and the last number kept is in the thousands. 

2,366,001 is rounded off to 2,370,000 since the last significant figure is in 
the 10 thousands and the part rounded off (5,001) exceeds half of 10,000. 

236,600 is rounded off to 236,000. The part rounded off (500) is just 
half a unit, so we leave it even. 

235,500 is also rounded off to 236,000. The part rounded off is exactly 
half a unit, so we make the uneven number even by increasing it one unit. 

299,700 would be rounded off to 3.00 X 10 6 . If we do not use standard 
notation we get 300,000 when we round it off, yet obviously some of the 
zeros are significant. We were rounding to three significant figures, so we 
gave the 3 and the two zeros, writing in standard form. 

These rules are not followed invariably in statistical computa¬ 
tion, especially when the use of computing machines makes it as 
easy to carry a result to 10 as to 2 places. Also, for reasons that 
will become evident later, it is sometimes necessary to carry out 
intermediate computations to several decimal places in order to 
ensure the desired accuracy of final conclusions. The student 
should always be on his guard, however, against “ accuracy ” that 
is purely fictitious, and the rules given above should be kept con¬ 
stantly in mind as guides. 1 

2.10. Suggestions for Further Reading.—The student will probably learn 
more about this subject by working out many examples than he will by 
further reading. C. H. Richardson, “An Introduction to Statistical 
Analysis,” Ilarcourt, Brace and Company, Inc., New York, 1935, gives a 
good and simple treatment in his introductory chapter. A brief but faulty 
statement appears in the 13th edition of the Encyclopaedia Britannica 
(also in the 11th and 12th editions) in the article on arithmetic, section VII 
on approximation, subsection 82 on degree of accuracy. It would be good 

1 It can be shown with little difficulty that one can expect, on the average, 
somewhat more accuracy in the arithmetic average of a set of numbers than 
there is in the original numbers themselves. This can be shown either 
empirically by finding the average of a set of numbers, rounding off some of 
the known digits, and comparing the average of the rounded numbers with 
the average of the original numbers, or on a priori grounds. We should, 
on the average, be able to carry one more significant figure in our average 
than there is in the original figures if we take the average of 10 numbers, two 
more significant figures if we take the average of 100 numbers, three more 
significant figures if we take the average of 1000 numbers, etc. If we have 
the weights of 100 people, each weight to the nearest pound, we can theo¬ 
retically carry the average weight to the nearest hundredth of a pound. 
In practice, if we are dealing with large numbers of cases, it is probably safe 
to carry the average to one more significant figure than the original figures. 
For an explanation of this fact, see Raymond Pearl, “ Medical Biometry and 
Statistics,” 2d ed., pp. 362j^., W. B. Saunders Company, Philadelphia, 1930. 
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practice for the student to discover for himself the error in the definition of 
significant figures there given. The last section of Chap. V in E. F. Lind¬ 
quist's “A First Course in Statistics: Their Use and Interpretation in Educa¬ 
tion and Psychology,” Houghton Mifflin Company, Boston, 1938, gives a 
good brief discussion. David Brunt, “The Combination of Observations,” 
Chap. 1, Cambridge University Press, London, 1931, discusses accidental 
and constant errors, with comments on errors caused by measuring instru¬ 
ments and those caused by the observer. He also points out (with proof) 
that the arithmetic mean of a number of observations is more accurate 
than the observations themselves. In general, perhaps the best treatment 
of the subject is found in books in the field of physics or astronomy which 
deal with the problem of measurement. See, for example, William Chauve- 
net, “A Manual of Spherical and Practical Astronomy,” Vol. II, Appendix, 
J. B. Lippincott Company, Philadelphia; or Dascom Greene, “An Introduc¬ 
tion to Spherical and Practical Astronomy,” Appendix, Ginn and Company, 
Boston. A splendid treatment appears in Willford I. King's “The Elements 
of Statistical Method,” Chap. VIII, The Macmillan Company, New York, 
1922. 


EXERCISES 

1. How many significant figures are there in each of the following numbers: 
2200; 134.6; 0.00054; 19,000.00; 1.300 X 10 8 ? 

2. Multiply 345.982 by 13.6, assuming that both are approximate 
numbers. 

3. The value of 7 r has been computed to several hundred decimal places. 
How would you decide, in the case of any actual problem, how many places 
to use? 

4. What answer would you give to the critic who says that rounding off 
of original figures and of answers makes conclusions inaccurate? 

6. Round off each of the following numbers until it has two significant 
figures: 3456.7; 0.0009460; 1821; 1871; 18,501; 19,500; 18,500; 19,999. 

6. A distance has been measured as 540,000 ft. correct to the nearest 
10 ft.; that is, it is known that the true distance is between 540,005 and 
539.995 ft. Write the measurement in such a way as to make it clear just 
how accurate the number is. 

7. When dealing with pure numbers you have been taught the “table of 
nines” as 9 X 1 = 9, 9 X 2 = 18, 9 X 3 — 27, etc. Write the “table of 
nines” as it would appear if the numbers involved were approximate 
measurements. 

8. Suppose a man is asked how many people attended a boxing bout last 
night. He reasons as follows: “The parking lot must be about 2% acres in 
size. It was about three-quarters full. I suppose you can get about 500 or 
600 cars to the acre—call it 500 to be safe. And suppose there were two 
people to the car. That would make 1875 people in attendance.” Com¬ 
ment on his answer. Assuming that his original figures are reasonable, what 
would you give as the answer? 

9. I own a rectangular building lot. The frontage on the street has been 
surveyed and found to be 97.53 ft. The depth of the lot has not been 
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accurately measured, but I paced it as 30 paces. My pace is approximately 
5 ft. What is the area of the lot? 

10. In an old graveyard we find a tombstone “Sacred to the Memory of 
Garland Waggoner, died August 14, 1731, aged 89 years.” In another 
plot we find another stone “Sacred to the Memory of Howard D. Newton, 
who departed this life August 14, 1731, aged 74 years, 8 months, and 7 days.” 
How much older was Waggoner than Newton? 

11. Multiply 3.47 X 10* by 6.91 X 10"* 

Divide 1.469 X 10“ 8 by 3.62 X 10“« 

Add 3.141 X 10*; 5.55 X 10 3 ; 9.65 X 10“"®; and 4.27 X 10* 

Multiply 136,385 by 7.2 X 10 s 

12. Write each of the eight numbers of Exercise 5 in standard notation. 

13. Dr. Thornton Page of the Yerkes Observatory estimates 1 that there is 
in the universe an average of but about .000,000,000,000,000,000,000,000,- 
000,01 (decimal point, 28 zeros and the number 1) gram of matter per cubic 
centimeter of space. 

a. Write this figure in standard notation. 

5. There are 4.16 X 10 16 cu. cm. per cubic mile; and there are 9.1 X 10* 
grams per ton. Convert Dr. Thornton’s figure to the equivalent number 
of tons per cubic mile. 

1 Quoted in Science News Letter , Vol. 59, No. 26, June 30, 1951, p. 411. 
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THE FREQUENCY DISTRIBUTION 

3.1. The Frequency Table. —The statistician usually works 
with large numbers of data. Originally, of course, these data are 
in the form of individual measurements. For example, the 
following figures are the marks received by 90 students on an 
examination in elementary economics, the highest possible mark 
being 208 : 


104 

57 

85 

203 

128 

121 

81 

105 

107 

100 

166 

109 

138 

75 

114 

75 

118 

109 

101 

81 

65 

143 

102 

107 

157 

149 

94 

165 

151 

181 

49 

158 

95 

206 

55 

81 

191 

142 

85 

82 

114 

79 

81 

136 

133 

122 

76 

103 

158 

43 

159 

150 

88 

176 

133 

153 

89 

89 

156 

112 

136 

92 

106 

112 

90 

119 

156 

82 

84 

163 

147 

179 

123 

104 

85 

131 

73 

107 

164 

158 

168 

93 

154 

102 

112 

69 

139 

142 

113 

147 


Even here, where we have but 90 figures, the impression 
received by inspecting the data is not sharp and clear-cut. 
Moreover, this method of listing the data takes much room. 
Hence statisticians usually condense results of this kind into 
more usable form. For example, they might make a table 
showing the number of times each mark occurred. This would 
appear like Table 3.1. 

Here we have the advantage that the figures have been 
arranged in order of magnitude, but we still have too many 
entries for easy comprehension. These data would usually be 
condensed even more, as in Table 3.2. 

26 
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Tabus 3.1.— Marks Received by 90 Students on an Examination in 
Elementary Economics. Highest Possible Mark, 208 


Mark 


Mark 

Number 

Mark 

Number 

Mark 

Number 

206 

1 

151 

1 

113 

1 

84 

1 

203 

1 

msSu 

1 

112 

3 

82 

2 

191 

1 

m 

1 

109 

2 

81 

4 

181 

1 

WEM 

2 

107 

3 

79 

1 

179 

1 

143 

1 

106 

1 

76 

1 

176 

1 

142 

2 

105 

1 

75 

2 

168 

1 

139 

1 

104 

2 

73 

1 

166 

1 

138 

1 

103 

1 

69 

1 

165 

1 

136 


102 


65 

1 

164 

1 

133 


101 


57 

1 

163 

1 

131 

1 

95 


55 

1 

159 

1 

128 

1 

94 

1 

49 

1 

158 

3 

123 

1 

93 

1 

43 

1 

157 

1 

122 

1 

92 

1 



156 

1 

121 

1 

90 

1 

Total... 

90 

154 

1 

119 

1 

89 

2 



153 

1 

118 

1 

88 

1 



152 

1 

114 

2 

85 

3 




Table 3.2. —Marks Received by 90 Students on an Examination in 
Elementary Economics. Highest Possible Mark, 208 


Mark 

Number of Cases 

Mark 

Number of Cases 

200-209 

2 

110-119 

8 

190-199 

1 

100-109 

14 

180-189 

1 

90- 99 

5 

170-179 

2 

80- 89 

13 

160-169 

5 

70- 79 

5 

150-159 

11 

60- 69 

2 

140-149 

6 

50- 59 

2 

130-139 

7 

40- 49 

2 

120-129 

4 

Total. 

90 


Of course, it is not necessary that we group the marks in classes 
of 10. We might choose to group them in classes of 50, in which 
case we should have Table 3.3. 

It will be noticed immediately that as we group the data in 
larger and larger classes we gain simplicity but lose detail. In 
neither Table 3.2 nor Table 3.3 do we know a single mark that 
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was assigned. We cannot tell whether or not anyone received a 
mark of 102. To be sure, we know that 39 students received 
marks from 100 to 149, but whether any one or all of them 
received the mark of 102 is not stated. We have condensed our 
data and made it easier to get an idea of the distribution of marks 
received by this class. But we have done it at the cost of 
exactitude. v 

Table 3.3.—Marks Received by 90 Students on an Examination in 
Elementary Economics. Highest Possible Mark, 208 


Mark 

Number of 
Cases 

200-249 

2 

150-199 

20 

100-149 

39 

50- 99 

27 

0- 49 

2 

Total. 

90 


Let us note, first, several things about the form of these tables. 
When data are arranged as these are, so that we are told the 
number of times that each of various values occurs, we say that 
we have a frequency table. Frequency tables may give each value 
that occurs and tell the number of times that it occurs, as does 
Table 3.1, page 27. More commonly they divide the data into 
classes and show the number of cases that fall within the limits of 
each class. This is the form in which Tables 3.2 and 3.3 appear. 

3.2. Class Limits. —Let us turn our attention now to the num¬ 
bers used to denote the classes. In Table 3.3, we find the classes 
described as follows: 

200-249 

150-199 

100-149 

etc. 

Each class is bounded by two figures, which are called the class 
limits. The class limits of the first class listed in Table 3.3, for 
example, are the numbers 200 and 249. The larger of these 
numbers (249) is called the upper limit , and the smaller (200) is 
the lower limit. 
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Class limits are not always exactly what they seem on casual 
inspection of a frequency table. Suppose, for example, that we 
have been testing samples of rubber bands produced in a certain 
factory to find out how heavy a load each band will carry before 
it breaks; and we find 123 samples to be distributed as in Table 
3.4. 

Table 3.4.— Hypothetical Example of Breaking Points of 123 Rubber 

Bands 


Breaking 

Point 

(pounds) 

Number 

of 

Cases 

4- 6 

5 

7- 9 

23 

10-12 

68 

13-15 

21 

16-18 

6 


At first sight it would seem that the 68 rubber bands which are 
listed together all broke at weights of 10 or 12 lb. or somewhere 
in between. We should be likely to conclude that the class 
limits are 10 and 12 lb. It would be surprising that we should 
have bands breaking between these points, but none breaking at 
weights between 9 and 10 lb. If the upper limit of one class is 
9 lb. and the lower limit of the next class is 10 lb., we have no 
place to classify weights of 9.3 lb., for example. 

At this point we need to recall from the preceding chapter just 
what these numbers mean. We remember that the number 9 
means “ between 8.5 and 9.5,” so when the upper limit of a class 
is given as the number 9, the actual limit is 9.5. The lower limit 
of the next class is given as the number 10, but this means 
“between 9.5 and 10.5,” so the actual lower limit is 9.5. Thus 
we see that there is really not a “no-man’s land” between the 
classes. The stated limits are 10 and 12, but the actual limits are 
9.5 and 12.5. 

Yet the actual limits are not always halfway between the 
stated limits. If we had a frequency table showing the numbers 
of families with various numbers of children, the first two classes 
might have the following stated limits: 
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1-3 

4-3 

Here it would be incorrect to Bay that the upper limit of the first 
class was really 3.5 children. There are no families with 3.5 
children. We know from the nature of the data that the upper 
limit and the stated limit in this case are the same. We have to 
decide what the stated limits mean by our knowledge of the 
characteristics of the data with which we are working, and it is 
always dangerous for a statistician, no matter how competent 
he may be in technical statistical theory, to work with data that 
he does not understand. 

There is, however, still another sort of case that needs explana¬ 
tion. Purely for convenience of tabulation, statisticians have 
agreed that when the stated upper limits of all classes end with the 
digit 9 the upper limit is to be considered as extending clear lip 
to the lower limit of the next class as stated. For example, we 
might restate our hypothetical example of the breaking points 
of rubber bands by transforming the class limits of Table 3.4 into 
the form given in Table 3.5. 

Table 3.5.— Hypothetical Example of Breaking Points of 123 Rubber 

Bands 


Breaking 

Point 

(pounds) 

Number 

of 

Cases 

4-6 9 

5 

7-9 9 

23 

10-12 9 

68 

13-15.9 

21 

16-18.9 

6 


In Table 3.5 each upper limit ends with the digit 9. Hence the 
upper limit of the lowest class, for example, is taken to be, not 
6.9 as stated and not 6.95, halfway between the stated upper 
limit and the stated lower limit of the next class above, but as 
6.999. ... In other words, any value as large as 4 but not so 
large as 7 would be put in this class, even if it fell short of 7 by an 
exceedingly small amount. This method of evaluating class 
limits is an exception to the general rule for interpreting the mean- 
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ing of numbers—an exception that is made merely because the 
statistician, in looking over his original data, can save a good deal 
of time if he knows that every value which begins with 4, 5, or 6 
goes in this class, regardless of the decimal which may follow it. 

In order to make the meaning of class limits even more evident, 
some authors state them thus: 

4 and under 7 
7 and under 10 
10 and under 13 
etc. 

These limits obviously mean exactly the same as those of Table 
3.5. Other writers, in an effort to save time, write these same 
class intervals thus: 

4- 

7- 

10 - 

In these cases the upper limits are not stated, and it is understood 
that the entries mean “from 4 up to but not including 7,” from 
7 up to but not including 10,” etc. The lower limits are stated, 
and the class is supposed to run up to the lower limit of the class 
that follows. This method of statement, however, is likely to be 
difficult for the novice to interpret. While it may be a useful 
timesaver for the statistician's own private work, it is not so good 
for publication as the other methods which have been mentioned. 

The student should develop the habit of inspecting every 
frequency table that he comes across to see if he can determine 
the actual class limits as distinct from those stated in the table. 
This sort of practice will do more than anything else to show the 
advantages of some statements and the disadvantages of others. 
When class limits are properly given, there is no room for doubt 
as to where any particular value should be classified. As one 
further example, suppose we are classifying men according to 
their weights, and two adjacent classes have the following stated 
class intervals: 

173-182 

183-192 

There is no question in this case where we should put a man who 
weights 182.3 lb. The actual upper limit of the lower class is 
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182.5, and the item 182.3 should be included there. To be sure, 
one could not be certain where he should put a case of exactly 
182.5 lb. Some authorities would favor dividing such a case 
between the classes, giving each of them one-half a case. Even 
easier, if our measurements have been made as accurately as to 
the tenth of a pound, would be to state our class limits thus: 

172.5- 182.4 

182.5- 192.4 

Now it is obvious where the case reported as exactly 182.5 goes. 
By stating our class limits to a decimal accuracy as great as the 
measurements that are to be classified, we are able to eliminate all 
doubt on classification, 

3.3. Overlapping Class Limits. —One sometimes sees tables 
published with the upper limit of one class coinciding with the 
lower limit of the next class, thus: 

25-27 

27-29 

29-31 

etc. 

In such cases it is impossible to tell where to classify an item that 
is exactly 27 or exactly 29. It seems to belong in two classes. It 
is confusing to the reader, and bad practice generally, to use such 
overlapping class limits. 

3.4. Open-end Classes. —Another bad practice often followed 
in making frequency tables is to set up a first class, or a last one, 
or both, in such a way that it is impossible to tell what the class 
limits are. We can illustrate this with Table 3.6, which is in 
many ways an example of how not to make a frequency table. 

Table 3.6.— Ages of Horses on Utah Farms. Hypothetical Data 


Age 

(years) 

Number of 
Horses 

0- 2 

35 

2- 5 

78 

5-10 

220 

10-20 

715 

Over 20.. 

31 
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In this table there are not only overlapping class intervals, and 
class intervals of unequal length, but also it is impossible to tell 
whether the 31 oldest horses were all very close to 20 years old, or 
whether they ran all the way up to 35, 40, or 50. Such a class is 
called an open-end class, and the inclusion of such classes in a 
frequency table materially reduces the value of the table to the 
statistician. If it seems necessary for any reason (such, for 
example, as those stated in Sec. 3.18) to leave an open-end class 
at either end of a table, it would add greatly to the value of the 
table if some further facts were given about the items included 
in the open-end class. In Table 3.6, for example, it would help 
materially if an asterisk were placed beside the figure 31 and a 
statement were then made below the table saying, “The average 
age of these 31 horses was 22.4 years,” or words to that effect. 
When an open-end class is used, one should give either the total 
or the average of the items in the class. 

3.6. Class Intervals.—The difference between the actual lower 
limit of any class and the actual lower limit of the next larger 
class is called the class interval. The class interval can also be 
defined as the distance between class marks (see Sec. 3.6). In 
Table 3.5 the class interval is 3 lb., since the lower limit of each 
class falls short by 3 lb. of the lower limit of the next larger class. 
Similarly the class interval of Table 3.4 is 3 lb., although the class 
limits are stated in different form. In Table 3.3 the class interval 
is 50. 

There are decided advantages in setting up the classes of a 
frequency table in such a way that all classes have the same class 
interval. In Table 3.6 no two classes have the same class 
interval. This would make many statistical computations 
unnecessarily difficult, and should be avoided if reasonably 
possible. At a later point in this chapter (see Sec. 3.18), reference 
is made to some kinds of cases where it seems wise to make 
exceptions to the general rule and to use unequal class intervals, 
but unless there is some good reason to the contrary, the rule that 
class intervals should be equal throughout any given frequency 
table is a good rule to follow. 

3.6. Class Marks. —For many of the statistical computations 
that we shall describe in the following chapters, it is necessary 
to know the class mark or the class mid-point of each class in a 
frequency table. This is the value midway between the actual 
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upper and lower limits of the class. In Table 3.5, for example, the 
class marks are 5.5, 8.5, 11.5, 14.5, and 17.5 lb. The actual 
limits of the smallest class are 4 and 6.99999 (approaching 7 as a 
limit), and the point halfway between them is found by adding 
them and dividing by 2. In Table 3.4 the class marks are 5, 8, 
11 lb., etc. In this table the actual limits of the smallest class 
are 3.5 and 6.5, and the point halfway between is 5. 

Sometimes, frequency tables are given with the classes defined 
by their mid-points rather than by the class limits. For example, 
Table 3.4 could be recast in the form of Table 3.7, and the two 
tables would be understood to mean exactly the same thing. 

Table 3.7. —Hypothetical Example of Breaking Points of 123 Rubber 

Bands 


Breaking 

Point 

(pounds) 

Number 

of 

Cases 

5 

5 

8 

23 

11 

68 

14 

21 

17 

6 


In this table the reader would understand that all 68 of the 
rubber bands in the central class did not break at exactly 11.0 lb. 
with no others breaking until exactly 3 lb. more had been added. 
He would decide that the values given in the left-hand column 
are class marks. If he needed the class limits he would realize 
that, just as the class marks are halfway between the actual 
limits, so the actual class limits are exactly halfway between the 
class marks if the latter are equally spaced. 

It will be seen immediately from Table 3.7 that the class 
interval can be determined from the class marks just as easily as 
from the class limits if the class interval is constant throughout 
the table. Where there are open-end classes, however, or where 
the class intervals are unequal, the problem is not so simple. 
Yet for most purposes, the class marks are the important things, 
and we can struggle along with unequal class intervals if the class 
marks are known. 
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3.7. Cumulative Frequency Tables. —Instead of describing the 
numbers of rubber bands that broke within certain ranges of 
weight, we might equally well have listed the numbers that broke 
at or below given weights, or those which broke at or above 
given weights. If we go back to Table 3.5 we see at once that 
only 5 bands broke at weights below 7 lb. Twenty-eight bands 
broke at weights below 10 lb. (since we would have to include 
the 5 that broke below 7 lb. and the 23 that broke between 7 and 
9.9 lb.) Likewise 96 broke at weights below 13 lb., 117 at 
weights below 16 lb., and 123 at weights below 19 lb. These 
figures can be derived directly from those of Table 3.5 (as the 
student should verify for himself), and we could state them in the 
form of Table 3.8. 

Table 3.8.—Hypothetical Example Showing Numbers of Rubber 
Bands with Breaking Points below Stated Amounts 


Breaking 

Point 

(pounds) 

Number of Banda 
Which Broke at 
Weights below 
Those Stated at Left 

4 

0 

7 

5 

10 

28 

13 

96 

16 

117 

19 

123 


Such a table is called a cumulative frequency table. Another 
form of cumulative frequency table could be made up showing 
the number of rubber bands with breaking points more than the 
stated amounts. For example, we could transform the data of 
Table 3.5 or 3.8 into the form shown in Table 3.9.♦ 

Sometimes, as in Table 3.8, our cumulative frequency table 
lists the numbers of cases smaller than given amounts. In 
such cases the table starts at zero and the numbers get larger 
and larger until they equal the total number of items studied. 
Thus Table 3.8 starts at zero and rises to 123, since there were 
123 rubber bands in the hypothetical example. At other times, 
as in Table 3.9, the cumulative-frequency table lists the numbers 
of items larger than given amounts. In such cases the table 
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starts with the total number of items studied and the numbers 
get smaller and smaller until they reach zero. The former 
(as in Table 3.8) are called less-than frequencies , while the latter 
(as in Table 3.9) are called more-than frequencies. 

Table 3.9.— Hypothetical Example Showing Numbers of Rubber 
Bands with Breaking Points above Stated Amounts 


Breaking 

Point 

(pounds) 

Number of Bands 
with Breaking Point 
Equal to or above 
That Stated at the 
Left 

4 

123 

7 

118 

10 

95 

13 

27 

16 

6 

19 

0 


/ 

3.8. Graphic Presentation: the Histogram.—As was pointed 
out in Sec. 3.1, the impression received by inspecting large 
numbers of individual figures is not sharp and clear-cut. In 
order to get a quick impression of the approximate sizes of the 
items, the statistician usually classifies them in a frequency 
table. The figures in Table 3.2 or 3.3 give one a very much 
quicker and more accurate idea of the marks that were received 
by these students than can be obtained from all the distracting 
detail of the original figures which are given on page 26. But 
perhaps the fastest way of all to get a general impression of the 
characteristics of a mass of statistical material is to present them 
in pictorial form, by means of graphs. 

When dealing with frequency distributions, one of the simplest 
of the graphical methods of presentation is the histogram. This 
is made by laying out a horizontal scale, representing the sizes 
of the items (that is, the students’ marks, or the breaking points 
of the rubber bands, etc.), and erecting thereon bars of various 
lengths, the lengths of the bars showing the numbers of cases. 
The data of Table 3.3, for example, are shown in a histogram 
in Fig. 3.1. It will be noted that the frequencies of the five 
classes of the table are now represented by five bars. The base 
line is marked off to represent the marks received, and since the 
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class limits in the table run from 0 to 250 our scale runs through 
the same values. The chart enables us to see at a glance that 
the commonest marks were those between 100 and 150, that there 
were very few marks below 50 or above 200, etc. 

If the class intervals in our original table had been unequal 
in size, it would have been much more difficult to make our 
histogram. Here again we note the value of equality in class 
intervals. When it is necessary to depict data from a table in 



Fig. 3.1. Frequency histogram of Fig. 3.2. Frequency polygon of 
data of Table 3.3. data of Table 3.3. 


which there are unequal intervals, however, adjustments can 
and should be made as explained in Sec. 3.19. 

3.9. Graphic Presentation: the Frequency Polygon.—As an 
alternative to the histogram, the data of Table 3.3 could be 
represented by a line connecting the mid-points of the tops of 
the bars of Fig. 3.1. In such a case we would locate the class 
marks on the horizontal scale at the base, and over each class 
mark we would locate a point corresponding with the frequency 
within the class. These points would then be connected by 
straight lines, as in Fig. 3.2. 

It will be noted in Fig. 3.2 that our frequency distribution 
is represented by a line that starts on the base line at a point 




38 


ELEMENTS OF STATISTICAL METHOD 


one half class interval below the mid-point of the smallest class. 
It then passes through a series of points, each one vertically above 
one of the class marks on the scale at the base. Each of these 
points lies at a distance above the base scale proportional to 
the frequency in the class in question. The line finally falls 
back to the base line one-half class interval above the largest 
class mark. 

In the case of the frequency polygon , as such a figure is called, 
as in the case of the histogram, the problem becomes more com¬ 
plex when class intervals are unequal, and in such cases it is 
necessary to make adjustments similar to those described in 
Sec. 3.19. 

The frequency polygon is perhaps more likely to be misleading 
than is the histogram, since the uninitiated is apt to attempt to 
read frequencies from the line at points between class marks 
It must be remembered that both the histogram and the fre¬ 
quency polygon are based on a frequency table that gave the 
numbers of cases within various classes, but showed us nothing 
about how they were distributed within those classes. The bars 
of the histogram obviously show the facts for classes as a whole, 
but the unwary reader is likely to select some particular poinl 
on the base line of the frequency polygon and try to read the 
corresponding frequency from the line. For example, in Fig. 3.2 
he is likely to interpret the diagram to mean that 14 people 
received marks of 50, since the line seems to have a height of 
about 14 above the point that represents a mark of 50 on the 
scale at the base. Yet a glance at Table 3.1 will show that 
really not a single student received a mark of 50. The frequency 
polygon must be interpreted as showing the numbers of cases 
within classes, and not the numbers of cases at particular points 

3.10. Graphic Presentation: the Frequency Curve.—If we 
could make our class intervals smaller and smaller, the bars in 
Fig. 3.1 would become narrower and narrower. Likewise the 
numbers of cases in the classes would become smaller and smaller 
(see Fig. 3.6). But if we could study larger and larger numbers 
of cases—not 90 students, but 900, or a million, or a limitless 
number—we could still make the bars narrower and narrower 
without making them disappear altogether. In such a case the 
line connecting the tops of the bars in Fig. 3.2 would probably 
come closer and closer to a smooth curve. The scientist assumes 
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that the values in a frequency distribution are not just chance 
affairs, but that they are distributed according to some law. 
The smooth curve which we should get if we could study enough 
cases is called a frequency curve . We often try by one means oi 
another to estimate what these frequency curves must be like. 
A large part of Chaps. VII and VIII is devoted to a study of 
certain types of frequency curves and the methods of describing 
them. We can always make a frequency polygon or a histogram 
from a frequency table, showing how the items were actually 
distributed; but we can never do more than estimate the nature 




Fig. 3.3. Frequency polygon, frequency histogram, and frequency curve, 
all based on the data of Table 5.1, page 83. 

of the frequency curve that underlies the data. When we do 
estimate such a curve by any of the means described hereafter, 
we usually draw it as a smooth curve on our diagram, thus 
distinguishing it from a frequency polygon, which is drawn with 
straight lines with breaks at the class marks. Figure 3.3 shows 
at the left a frequency polygon of the data of Table 5.1, page 83, 
while the right half of the figure shows a histogram of the same 
data. Superimposed on the histogram is a frequency curve 
computed by methods described in Chap. VII. While the 
histogram shows the way that the heights were actually dis¬ 
tributed in the cases studied, the frequency curve represents, on 
the basis of certain assumptions, the underlying law of the 
distribution of men’s heights. 

3.11. Graphic Presentation: the Ogive. —Just as the frequency 
polygon represents an ordinary frequency table, so we could draw 
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a chart that showed the data of a cumulative frequency table such 
as Table 3.8 or 3.9. At the left of Fig. 3.4, the data of Table 
3.8 have been so depicted, while at the right of the same figure 
are shown the data of Table 3.9. Charts of this kind, repre¬ 
senting cumulative frequency distributions, are called ogives . 
Often the vertical scale is drawn to represent percentages of 
the total number of cases, running from 0 to 100 per cent. Such 
an arrangement makes it easier to compare tjvo ogives based on 
different numbers of cases. 




Fig. 3.4. Two ogives. The left-hand section is a “less-than” ogive, and 
the right-hand section is a “more-than ” ogive. 

3.12. What to Look For in a Frequency Table.—As was pointed 
out in Sec, 3.10, the scientist assumes that every frequency dis¬ 
tribution tends to follow some design or pattern. Different 
plants or animals or observations of physical phenomena are not 
all exactly alike, but neither do they differ planlessly. Until 
one has learned to think in terms of frequency distributions, he 
has not really become a scientist. Frequency tables, or their 
graphic counterparts, are basic to an understanding of scientific 
work in general, and particularly to that aspect of it which we 
study in statistics. 

The trained statistician gets from a frequency table a good 
summary picture of the distribution on which it is based. He 
notes the approximate maximum and minimum sizes of items 
which are included and the points, if any, of heaviest concentra¬ 
tion. If the curve rises toward a high point somewhere toward 
the center, he notices this fact and the approximate position 
of the peak of the curve. 
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As an illustration we can look back again at Table 3.3, or 
at either Fig. 3.1 or 3.2, both of which are based on that table. 
The statistician looking at this diagram would note at once 
that the marks tend to pile up in the center, somewhere near the 
score of 125, and that marks above 200 or below 50 are very 
unusual. If the class interval had been smaller and the number 
of classes correspondingly larger, it is likely that even more 
information could be derived at a glance. 

3.13. Common Shapes of Frequency Curves.—While a histo¬ 
gram or frequency polygon might assume almost any shape, long 
experience with varied kinds of data has shown that most dis¬ 
tributions tend to fall into one or another of a relatively small 
number of classes. It is consequently assumed that most fre¬ 
quency curves assume a reasonably small number of shapes. 

By far the largest proportion of frequency distributions seem 
to be mound-shaped or humpbacked , with small numbers of cases 
near the extremes and larger numbers of cases near the center. 
With certain kinds of data, there seems to be good reason for 
anticipating that the values would be arranged in some such 
pattern, as we shall see in Chap. VII; but even where there is no 
a priori ground for expecting it, we find over and over again that 
distributions of radically different kinds of data from distant 
branches of science assume this mound-shaped form. 

Sometimes the mound-shaped distribution is symmetrical, 
with the right-hand side of the curve presenting a mirror image 
of the left-hand side. In other cases, even though there is a 
high point in the curve somewhere between the two extremes, the 
curve is asymmetrical, or skewed. A symmetrical frequency 
curve is shown at the left in Fig. 3.5 and an asymmetrical curve at 
the right. We shall have occasion to study symmetry and lack 
of symmetry at a later stage (see Chap. VIII). 

But it would be a mistake to assume that frequency distribu¬ 
tions are always mound-shaped. Sometimes a frequency dis¬ 
tribution starts with a high point at the left end and falls lower 
and lower as one moves toward the right. Possibly a curve 
might start at a low point on the left and run higher and higher 
until its highest point was at the extreme right. Such a dis¬ 
tribution is called a J-shaped distribution as distinct from the 
mound-shaped distributions which are much more common. 

Suppose you were to investigate all the women in the United 
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States between the ages of 15 and 20, and you were to find how 
many had never been married, how many had been married 
once, how many twice, how many three times, etc. Your data 
would form a frequency distribution, and the chances are pretty 
good that it would be J-shaped, starting very high on the left 
and falling lower and lower. Or imagine that you could classify 
the men of the United States according to the numbers of warts 
on their noses. Presumably you would again find a J-shaped 
distribution, with the largest class being those with no nasal 




Fig. 3.5. Symmetrical frequency curve at the left, and skewed curve at the 
right. 

warts, the next-largest class being those with one such disfigure¬ 
ment, and with the number of men falling as the number of 
warts rose. These illustrations should help the student to 
understand that there is nothing unnatural about J-shaped 
distributions—that mound-shaped distributions are not “cor¬ 
rect” or “proper.” With certain kinds of data one seems in 
practice to find mound-shaped distributions, but with other 
sorts of data it is just as natural to find other patterns. 

Once in a long time one finds a distribution that yields a curve 
with a low spot in the middle and high spots at both ends. Such 
a distribution is called a U-shaped distribution . It has been 
shown that the percentage of cloudiness at certain weather 
stations seems to follow the U-shaped distribution; that is, 
there are many days when the sky is completely obscured by 
clouds, and many days when there are no clouds at all. But as 
one comes closer and closer to the point where half the sky is 
clouded and half clear, he finds fewer and fewer days to use as 
illustrations. It has also been suggested that marks in certain 
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difficult and advanced college courses tend to run in U-shaped 
distributions, on the theory that the only people who elect the 
courses are those who are either very good at the subject or too 
ignorant even to understand their own lack of ability. U-shaped 
distributions are so uncommon in practice, however, that statis¬ 
ticians look anxiously for them to use as illustrations, and the 
ordinary student does not need to expect to encounter them. 

3.14. Common Shapes of Ogives.-—Since an ogive is a graph of 
a cumulative frequency curve, it is evident that there will be a 
definite relationship between the shape of the frequency curve 
and the shape of the ogive based on the same data. An ogive, as 
we have seen, either starts at the bottom and works to a maxi¬ 
mum or starts at the maximum and works down to zero (see 
Sec. 3.11). But in order for an ogive to be a straight line, it 
would be necessary for the frequencies of all classes in the fre¬ 
quency table to be equal, since in that case each time we added 
a new class we would add the same frequency, and our line would 
rise always at the same rate. This uncommon sort of frequency 
distribution is called a rectangular frequency distribution, and it 
would be represented by a frequency table in which each class 
had the same frequency; or by a histogram in which all bars were 
the same length; or by a frequency polygon which was a straight, 
horizontal line; or by an ogive which was a straight line, rising 
or falling according to whether we have more-than or less-than 
frequencies (see Sec. 3.7). 

But the commonest kind of frequency curve, as we saw in 
the preceding section, is the mound-shaped curve. In such a 
distribution, the first few classes are small, getting larger and 
larger for a time, reaching a maximum ultimately, after which 
they get smaller and smaller. If we are to add these classes to 
form an ogive, it is evident that our line will start out low (if we 
have less-than frequencies), and at first we shall add only small 
increments to it. But as we pass to the larger classes, the fre¬ 
quencies become greater, so each time we add a little more than 
the time before. For this reason, our ogive rises more and more 
steeply until we reach the point corresponding to the peak of the 
frequency polygon or the largest frequency in the frequency 
table. Thereafter we keep on adding classes, and consequently 
our ogive continues to rise—but each added class is smaller than 
the one before it, and therefore we add less and less each time. 
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For this reason our ogive rises more and more gradually, finally 
tending to flatten out and approach the horizontal as it nears the 
top. We thus see that the ogive which corresponds to a mound¬ 
shaped curve is (if we use less-than frequencies) a rising line 
in an S-shape (see Fig. 5.2, page 97). If we use more-than 
frequencies, a similar line of reasoning will show that we get a 
falling line with a reverse S-shape. In fact, it is because mound¬ 
shaped distributions are most common, and because their ogives 
have this characteristic S-shape, that these cumulative-frequency 
curves are called ogives. The student will recall that the so-called 
ogee curve of architecture or in furniture is an S-shaped curve, 
and the ogive gets its name from its common S-shape. 

3.15. Making a Frequency Table: How Many Classes? —It 
is now time to leave our general discussion of the nature of fre¬ 
quency distributions and pay some attention to the practical 
problems encountered in the actual making of frequency tables. 
If you were faced with the problem of making a table from a 
large number of original figures such as those listed on page 26, 
your first problem would be to determine how many classes 
to make. Should you divide the 90 marks into 17 different 
groups, as in Table 3.2, or into 5 different groups, as in Table 3.3, 
or should you decide on some other number? 

It is evident at once that the number of classes in a frequency 
table depends on the size of the class interval. In Table 3.2, 
where the class interval is 10, there are many more classes than 
there are in Table 3.3, where the class interval is 50. In fact, 
the number of classes and the size of the class interval will be 
roughly, though not exactly, in inverse proportion. 1 

1 The student with a mathematical turn of mind will be interested in 
proving for himself that there is one case where one can tell in advance 
something about the relationship between the number of classes and the 
size of the class interval. This is the case where we have made a table with 
some given class interval, and we make a new table with a smaller class 
interval 1/nth as large as the old one, where n is an integer. In such a case, 
if the n new classes are contained wholly within one of the old classes, not 
overlapping at the limits, it should be easy for the student to demonstrate 
that the new table with smaller classes may have as many as n times the 
former number of classes (in which case the class interval and the number of 
classes have varied in exact inverse proportion), or the new number of 
classes may be as much as 2(n — 1) smaller than n times as large as the first 
grouping. If the first coarse grouping had m classes, the new grouping may 
yield as many as ran or as few as mn — 2(n — 1) classes. 
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It may seem at first that the number of classes is immaterial. 
But some idea of the importance of a correct choice in the matter 
may be obtained from Fig. 3.6, which shows the data of Table 
3.1, page 27, in four different histograms, with class intervals 
of 100, 50, 25, and 10. It will be seen at once that when the 
class interval gets too small the diagram loses that simple regu¬ 
larity which characterizes the underlying law of the distribution. 
We begin to get all sorts of erratic variations in the lengths of 
the bars. This is the result, at least in part, of the fact that the 



Fig. 3.6. Data of Table 3.1 plotted with various class intervals. 


number of cases in each class has become very small and, there¬ 
fore, particularly unreliable and subject to chance fluctuation. 
Just as you would probably hesitate to estimate the average 
weight of newborn girafles after having seen but two or three 
of them, so you would not expect to get much accuracy from a 
class that contained but three or four cases. 

Inspection of Fig. 3.6 shows, however, that as the class interval 
grows larger, and the number of classes grows smaller, we get 
enough cases in each class so that the erratic variations tend to 
disappear, and the underlying pattern becomes much plainer. 
To be sure, we can go too far in this direction, making so few 
classes that no pattern at all is evident. If we were to take a 
class interval of 300 in this illustration, all our cases would fall 
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in the first class from 0-299, and we would see nothing of the 
nature of the distribution at all. Hence we can say, first, that 
we want the number of classes to be both small enough and large 
enough to show the nature of the distribution. 

In addition, we are to see at later points in this book that the 
statistician often treats all the cases in a given class as though 
they were equal in size, all being equal to the class mark. This 
is a very useful assumption, and will not give us any great error 
if the classes are reasonably narrow. But if the classes get too 
wide, it will patently be unwise to assume that all the cases in a 
given class are even approximately the same size. 

Also if we make our class intervals too small, we lose one of 
the major advantages that we seek from classification in fre¬ 
quency tables. Suppose, in the case of the student marks, that 
we set our class interval as low as one unit. Then we get Table 
3.1, page 27. Here we have almost as many classes as the 
original number of cases. It is the purpose of frequency tables to 
condense and compress our data, to rid them of their minor 
peculiarities, and to present them in summary form so that we 
can grasp them quickly. We could even use a class interval 
smaller than one, say one-tenth. Then our classes would look 
something like this 

81.05-81.14 

81.15-81.24 

81.25-81.34 

etc. 

Then, since our original figures were all whole numbers, 9 out of 
every 10 classes would be vacant. 

We shall mention in Sec. 6.9 another suggestion as to the 
number of classes in a frequency table, but for the time being we 
can summarize by saying that the number of classes should be 
large enough (and the class interval small enough) so that all 
items in a class may reasonably be treated as equal without too 
much error, and so that the general pattern of the distribution is 
not obscured by lumping together too large a proportion of the 
items in a very small number of classes. On the other hand, 
the number of classes should be small enough (and the class 
interval large enough) so that our data are compressed into a 
reasonably small number of classes, so that there will be no 
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vacant classes unless near the extremes of the data, so that the 
pattern of the distribution is not obscured by erratic fluctuations 
from class to class, and so that there is a reasonably large number 
of items within each class except possibly near the extremes 
of the data. 

3.16. Making a Frequency Table: Rules of Thumb. —Instead 
of a general discussion of the issues involved, many authors 
have contented themselves with stating arbitrary rules as to the 
numbers of classes to be used in a frequency table. Perhaps 
commonest are statements that the number of classes should 
vary between 12-25 or between 15-20, or some other arbitrary 
limits. 

The student will see at once that no general statement can 
be made which will cover all frequency distributions. For 
example, if we have but 25 cases we evidently cannot use even 
as many as 10 classes and get reasonable numbers of cases within 
most of them. On the contrary, if we have 10,000 cases we may 
well be able to spread them over 50 classes and still get a good, 
smooth curve which shows well the general nature of the under¬ 
lying pattern. The number of classes and the number of cases 
is directly related. 

Sturges has endeavored 1 to give us a definite rule to cover ail 
cases. He suggests that we let 

m = 1 + 3.3 log N 

where m is the number of classes and N is the number of cases. 
For example, if we have 842 cases, Sturges’ rule tells us that the 
number of classes should be 

m = 1 + 3.3(log 842) « 1 + 3.3(2.92) = 10.6 

This rule suggests, then, that we should use 10 or 11 classes. The 
rule is easy to apply, but most statisticians seem to agree that the 
formula suggests too many classes when the number of cases is 
small and too few classes when the number of cases is large. 
Actually, the choice of the number of classes to use will have to 
depend mainly on the nature of the data studied, and on the units 
in which they are stated, far more than on any arbitrary rule laid 

1 H. A. Sturges, The Choice of a Class Interval, Journal of the American 
Statistical Association t Vol. 21, 1926, pp. 65-66. 
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down in advance. Perhaps we can say again that we want to get 
the class interval small enough so that all items in a class can be 
treated as roughly the same size, but that, subject to this restric¬ 
tion, the fewer classes we can make and still show the underlying 
pattern of the distribution, the better. 

3.17. Making a Frequency Table: Choosing the Class Interval. 
When, by means of Sturges* rule or by some other means, we have 
decided on the approximate number of classes for our table, the 
next problem is that of selecting a class interval which will 
yield that number of classes. Let us go back, for example, to the 
data with which we opened this chapter, giving marks received 
by 90 students on an examination (see page 26). Sturges’ rule 
would tell us that we should have 7 or 8 classes. Suppose, for 
purposes of illustration, that we accept these figures as correct. 
What class interval should we use to get 7 or 8 classes? The 
answer is easy to determine. Inspection shows that the highest 
mark received by anyone was 206 and the lowest mark was 43. 
The range, or difference between the highest and lowest values 
in the distribution, was 206 — 43 = 163. If we want to divide 
these 163 units into 7 classes we get =23 + as our class 
interval. If we want to get 8 classes we should use 16 % = 20+ 
as our class interval. 

It would be very foolish for a statistician to follow any rule 
so slavishly that he would set his class interval at exactly 16 % or 
23^f just because his arithmetic yields that quotient. He will 
save time and effort and get results just as good if he uses a class 
interval that is reasonably convenient. In the two cases just 
illustrated, for example, the statistician would be likely to choose 
an interval of 25 where the rule gives 23+ , and an interval of 
20 where it gives 20+ . Class intervals that are in tens or 
multiples of tens, or in units or exact decimal values of units, 
are by far the easiest to use. Class intervals of 1, 2, 3, 5, 10, 20, 
25, etc., are the most common. We can therefore state our rule 
for finding the class interval as follows: (1) Find the range 
(difference between the largest and smallest value). (2) Divide 
the range by the number of classes that you have set up as being 
approximately right. (3) Use the quotient as the approximate 
class interval, but round it off to a whole number, and if possible 
to some number easy to work with in classifying the items. 

Where one is dealing with very large numbers of cases, it is 
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not even necessary to determine the exact range. A rather 
hurried inspection will usually show approximately the largest 
and smallest items, and from them an approximate range can be 
computed which is just as good as the exact range, since our 
answer is to give but an approximation to the class interval at 
any rate. 

3.18. When to Use Unequal Class Intervals. —It has been 
pointed out over and over again in this chapter that there is 
advantage in using uniform class intervals throughout a frequency 
table if it can be done reasonably . But now we must pay some 
attention to those cases where there is good reason to use unequal 
class intervals. First let us see what sorts of cases there are 
in which unequal intervals may be desirable. The principal 
reasons for equal class intervals are that the frequencies are 
then directly comparable from class to class and that statistical 
computations are greatly facilitated. But even these advantages 
do not in all cases outweigh the advantages of unequal intervals. 

In the first place, we have cases of badly skewed distributions, 
where one end of the curve runs far, far away from the peak. 
For example, a frequency distribution of the incomes received 
by people in the United States would show that most of the 
incomes are bunched rather closely around $500 to $2000. 
Relatively few people receive incomes below $500 and relatively 
few over $2000 per year (see Fig. 5.3). If we want to show how 
these incomes are distributed, we cannot take a class interval 
of $5000, or even of $2000, or we shall lump all these cases 
together in one class and obscure the shape of the distribution 
entirely. Yet suppose we were to decide on a class interval 
of $1000 (which would still be far too large for actual use). 
The largest incomes were several million dollars a year. If we 
are to include enough classes at $1000 per class to reach the 
highest incomes, it would be necessary to make thousands of 
classes. This is out of the question. Hence we make small 
classes where the cases are numerous and larger classes where 
the cases are sparse. If we were to make uniform class intervals, 
and have a reasonably small number of classes, it is clear that we 
would get a J-shaped distribution, since our first class would 
have to contain incomes from zero to perhaps $100,000 or more. 
When a statistician finds a J-shaped distribution, he always 
tests the frequencies at the more populous end by trying smaller 
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class intervals to see whether it is actually J-shaped or contains 
a hidden mound near the end. 

A second reason for using nonuniform class intervals may be 
to get similar cases together. Vital statistics are often classified 
in the following age groups, for example: 


Under 1 year 

10-14.9 

1-1.9 

15-19.9 

2-2.9 

20-29.9 

3-3.9 

30-39.9 

4-4.9 

40-49.9 

5-9.9 

50-59.9 

etc. 


In such a classification, the very young children are put in 
small groups largely because it is thought that the problems of 
children under the age of 1 year differ enough from the problems 
of children between one and two years so that it will be advan¬ 
tageous to classify them separately, while the problems of people 
of the age of 50 and those of people aged 59 may be roughly 
similar from the standpoint of the vital statistician. The 
scientist would be foolish to lump together cases which should 
be treated separately merely in order to retain uniformity of 
class intervals. 

A third reason for using nonuniform class intervals in some 
tables is to keep data confidential. Uniform class intervals are 
likely to bring small frequencies in classes near the extremes. 
Where there are only one or two cases in an extreme class, it 
may be easy for informed people to figure out who is who, 
discovering what income this firm gets, or what are the costs 
of that firm, etc. Many figures, especially those collected by 
the government, are obtained on the basis of promises that they 
will be held in confidence, and uniformity of class intervals may 
make this retention of confidence impossible. 

For any of these reasons, or, perhaps, for others, 1 it may be 
decided that the frequency table should be made up with unequal 
class intervals even though there are many disadvantages of 
such a course. In such a case, certain precautions should be 
taken to make sure that the results are not misleading. 

1 See, for example, Sec. 8.12. 
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3^19. How to Use Unequal Class Intervals.— Let us turn our 

attention to Table 3.11, in which there are unequal class intervals. 
The dangers inherent in the use of such grouping are immediately 
apparent. As we glance through the table, we get the impression 
that the heaviest concentration of cases falls in the class 70-89. 
It appears that the frequencies get larger and larger until we 
reach a peak at 40-49, after which we have a slight fall, rising 
again to an even higher peak in the class 70-89, after which the 
frequencies fall again. 

Table 3.11.— Illustrating the Use of Unequal Class Intervals 


Class 

limits 

Number 
of Cases 

0- 9 

5 

10- 19 

22 

20- 29 

35 

30- 39 

39 

40- 49 

41 

50- 59 

39 

60- 69 

35 

70- 89 

48 

90-109 

28 

110-129 

16 

130-169 

12 


Yet when we look at the table carefully, we notice that the 
class interval is twice as great in the 70-89 class as it is in any 
of the preceding classes. If the frequencies were concentrated 
just as heavily in this class as in the preceding one, there should 
be twice as many cases, since the class interval is twice as great. 
Yet there are not twice as many cases. The preceding class 
has 35 cases as compared with 48 cases here. 

If we are to make the cases comparable, it should be evident 
that we must divide each frequency by its class interval. This 
would give us a new table such as Table 3.12. 

Now we see that the frequencies build up smoothly from each 
end toward the middle and that there is really one high point 
rather than two. Perhaps this can be visualized even better 
from Fig. 3.7. In the upper part of this figure, we see a histogram 
of the data of Table 3.11 made without any adjustment for 
inequality of class intervals, and therefore giving the incorrect 
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impression. The lower part of the figure shows the data cor¬ 
rectly plotted from Table 3.12, and in this case one gets the 
correct impression at once. In making this correct histogram, 
each bar covers a width on the base line corresponding to its 
class interval, and the area of the bar (width times height) 



Fig. 3.7. Adjustment of histogram for unequal class intervals. Same data 
with and without adjustment. 

is proportional to the frequency actually found in the class. 
In order to get this proportionality, the heights of the bars are 
not proportional to the original frequencies, but proportional to 
the adjusted frequencies of Table 3.12. 

Since our class intervals in Tables 3.11 and 3.12 are 10, 20, 
and 40, it is immaterial whether we make our adjustments by 
dividing by 10, 20, and 40 or by dividing by 1, 2, and 4. Either 
will put our results in the same proportions. It is perhaps easier 
to state the rule for adjusting frequencies where there are unequal 
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class intervals by saying that one divides each frequency by the 
corresponding class interval, but as long as the frequencies are 
divided by numbers proportional to the class intervals the proper 
results will be obtained. 

Table 3.12.— Illustrating Adjusted Unequal Class Intervals 


Class 

Limits 

Frequency per 
Unit of Class 
Interval 

0- 9 

0.5 

10-- 19 

2.2 

20- 29 

3.5 

30- 39 

3.9 

40- 49 

4 1 

50- 59 

3 9 

60- 69 

3 5 

70- 89 

2.4 

90-109 

1.4 

110-129 

0.8 

130-169 

0.3 


3.20. Logarithmic Frequency Classes.—When a frequency histogram 
looks like the lower part of Fig. 3.7, some statisticians suggest that there 
may be real advantage on technical and theoretical grounds in using unequal 
class intervals of a particular kind. These are class intervals so arranged 
that the successive lower class limits will be in constant proportion, rather 
than differing by constant amounts. It is easy to show that in such cases 
the class intervals of successive classes will also form a geometric progression 
instead of being equal. 

If we have a frequency distribution in which the values range in size 
from a low of 15 to a high of 200, and we wish to distribute them in 10 loga¬ 
rithmic intervals, we fall back on the familiar formula for geometric 
progressions: 

L * S( 1 4* r) n 

where L is the largest value, S the smallest value, and n the number of 
logarithmic classes. Solving by logarithms we get a value of (1 + r), which 
is the amount by which the lower limit of each class must be multiplied to 
find the lower limit of the next larger class. For example, in the distribution 
mentioned above, L is 200, 5 is 15, and n is 10. 

200 - 15(1 + r) 10 

This yields a value of (1 + r) = 1.296. Starting with the lower limit of 
the lowest class at 15, and multiplying each lower limit by 1.296, we find 
the lower limits of succeeding classes, and set up the following unequal 
logarithmic class intervals: 
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15.0-19.4 
19 5-25.1 
25 2-32 6 
32.7-42 3 
42.4-54 8 


54 9-71 0 
71 1-92 0 
92 1-119 3 
119.4-154.6 
154 7-200.5 


It will be noted that the ciasB intervals are unequal, starting with an interval 
of about 4.5 and ending with one of about 45. Each interval is 1.296 times 
as great as the preceding one. 

By their nature, logarithmic frequency classes cannot start at zero. Some 
statisticians suggest that if a distribution yields a more nearly symmetrical 
frequency polygon in logarithmic classes than in equal class intervals, we 
should use the logarithmic classes and also use the geometric rather than 
the arithmetic mean (see Chap. V). Logarithmic frequency classes of 
another kind can be fitted by a method described by George R. Davies m 
the Journal of the American Statistical Association , Vol. 20, p. 467. The 
beginning student will do well, however, to confine himself to equal class 
intervals unless he has some compelling reason to do otherwise. 

3.21. Making a Frequency Table: Locating the Class Marks.— 

Having decided upon our class interval in accordance with the 
suggestions of Sec. 3.17, we still have to decide where to locate 
the class marks, or, what amounts to the same thing, where to 
locate the class limits. Suppose we have decided to use a class 
interval of 10, and the smallest value in our distribution is 68. 
Shall we set up our first two classes thus: 

GO 69 
70-79 
etc. 


or shall we use the following: 

62-71 

72-81 

etc. 

or might we decide on such a peculiar arrangement as 

64.38- 74.37 

74.38- 84.37 
etc. 

In each of these cases, the class interval is 10, and it is obvious 
that there are limitless other combinations of class limits that 
could be used, still retaining the class interval at this size. The 
decision as to the size of the class interval has not completed 
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our problem of setting up our frequency table, since we still must 
locate the class limits. 

Unless there is some good reason to the contrary, we usually 
take class limits that are whole numbers, such as those given in 
the first two of the three illustrations of the preceding paragraph. 
And if the class interval is 5, 10, 25, 50, or 100, or some such 
number, we commonly make each lower limit an exact multiple 
of the class interval, as in the first of the three illustrations in 
the preceding paragraph. Some writers have suggested that the 
class marks, rather than the lower limits, be made whole numbers 
and, if possible, multiples of 10. Their argument is that such 
an arrangement will save time in later computation when, as we 
shall see, the assumption is made that all values in a class are 
equal in size to the class mark. But when the table is such that 
computations can be made by the short method rather than by 
the long method (as explained in Chap. Y), there is no advan¬ 
tage in having integral class marks, and classification of items is 
speeded up usually by having integral class limits. 

Sometimes the data being tabulated run down to zero and then 
stop, negative values being impossible. For example, we might 
consider a frequency distribution showing the number of corpo¬ 
rations hiring various numbers of employees. No concern 
would hire a negative number of men, so that our values may 
not run below zero. In such a case, if the values actually do 
run down to or close to zero, it is evident that we cannot maintain 
uniform class intervals at the lower end of the table with some 
locations of class marks, while we can with others. Suppose 
again that we are using a class interval of 10. If we have classes 
of 33-42 and 23-32, etc., our low classes will have to be 3-12 
and 0-2. This make the lowest class interval smaller than the 
others. Sometimes, then, the location of the class marks would 
be determined by our wish to keep even our smallest class uniform 
with the others. 

A third consideration in the locating of class marks becomes 
prominent in those distributions where certain values are common 
and other values either do not appear at all or appear uncom¬ 
monly. For example, it may be that tickets to a ball game are 
sold at 25 cents, 50 cents, and $1, but that no other values 
occur. No ticket will be purchased at 38 cents or at any inter¬ 
mediate value. Yet again, we might be listing the numbers of 
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rooms in houses, in which case we could get 5-room houses or 
6-room houses, but there would be no houses with 5.4 rooms. 
Distributions of this character, where only certain values are 
possible, are called discrete distributions. We can contrast them 
with continuous distributions , in which any intermediate value 
can occur. For example, men's heights do not necessarily fall 
at 70 or 71 in. or any other particular value. It is quite possible 
for a man's height to be 70.342 in., or any other conceivable 
value within the whole range from the shortest to the tallest 
person. Most distributions with which the statistician deals 
are either continuous, or the breaks are so small compared with 
the range of the data that they can be safely treated as con¬ 
tinuous. As an example of the latter, if we were classifying 
incomes received by people in the United States one might argue 
that the distribution is discrete, since one can receive $1,043.21 
or $1,043.22, but not between. Yet 1-cent breaks are so small 
when compared with the vast range of incomes that there is little 
error in saying that the distribution is continuous. 

Some people have described distributions as homograde where 
we have called them discrete and as heterograde where we have 
called them continuous. But unfortunately there is no uni¬ 
formity of usage for these terms, since other authorities would 
say that a homograde series is one in which there are only two 
possibilities—a characteristic is either present or it is absent. 
Thus a division of people into those who are vaccinated and those 
who are unvaccinated would, by this definition, constitute a 
homograde series, while a heterograde series would be one that 
showed variation in magnitude. Most statisticians would dis¬ 
tinguish these last two cases in other terms by saying that when 
we study those things which are either present or absent we are 
studying attributes , while when we study things in which the 
magnitude can assume many different values we are studying 
variables. There is no confusion if we speak of continuous and 
discrete data, nor if we speak of attributes and variables. There 
is, however, difference in the usage of the words homograde and 
heterograde. 

With discrete data, it is natural that there should be bunching 
of values at particular points, since no intermediate values can 
occur. But sometimes we get a similar bunching even in cases 
where intermediate values could occur. For example, when 
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people are asked their ages they usually give whole numbers of 
years, leaving off intermediate fractions; and census data also 
show that there is a very real tendency for people to give their 
ages in multiples of 5 or 10, stating that they are 40 or 45 years 
old even though they may really be 41 or 44. Estimated values 
are particularly likely to show this bunching. If we ask people 
to estimate distances or ages or weights, they are apt to do it in 
whole numbers and in multiples of 5 or 10. A farmer may say 
that he grows 30 acres of wheat or 35 acres, but he is very unlikely 
to state that he grows 31.398 acres even though that may be the 
fact. Since many statistical computations are based on numbers 
which were originally estimates it is worth while to keep this 
fact in mind. 

Regardless of the reason for bunched values, whether because 
the data are discrete, or because they are estimated, or because 
they are given only approximately, or for any other reason, data 
which are characterized by points of marked concentration 
should be tabulated with the points of bunching at the class 
marks. This is because we shall later assume that all items 
in the class are at the class mark, and the error will be minimized 
if the bunched cases are actually located there. 


3.22. Summary: Directions for Making a Frequency Table.—We can 
now summarize the actual steps involved in making a frequency table as 
follows: 

1. Decide whether or not to use equal class intervals. Use equal intervals 
if reasonably possible. See Sec. 3.18 for suggestions concerning the use of 
unequal intervals. The remaining steps summarized below assume the 
use of equal class intervals. 

2. Decide how many classes to use. For suggestions see Secs. 3.15 and 
3.16. 

3. Find the range or the approximate range between the largest and the 
smallest values in the distribution. 

4. Divide the range found in step 3 by the number of classes found in 
step 2. Use the quotient as a first approximation to the class interval. 

5. Select a class interval that is convenient—usually a whole number 
and possibly a multiple of 5 or 10—using the approximation of step 4 as 
a basis. 

6. Decide where to locate the class limits. For suggestions see Sec. 3.21. 

a. Lower class limits should usually be whole numbers, often multiples 
of 5 or 10. 

b. Make sure that the lowest class can be included without altering the 
class interval. 
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c. If there is bunching for any reason (as in discrete series or where values 
are based on estimates) put the popular values at class marks. 

7. Having laid out the class limits, distribute the original values among 
the classes, noting how many items fall in each class. 

3.23. Suggestions for Further Reading.—It is impossible to cover ade¬ 
quately a good deal of important material on frequency distributions in one 
chapter. In Chaps. VII and VIII we shall discuss again certain particular 
forms of frequency distributions which are of especial statistical importance. 
Karl Pearson suggested methods of treating some of the commoner frequency 
curves. His original memoirs on the subject may be found in Philosophical 
Transactions, A, at the following three points: Vol. 186, pp. 343#.; Vol. 197, 
pp. 443#.; and Vol. 216, pp. 429#. Perhaps even better for the general 
student than these scattered references would be the more compact treat¬ 
ment found in the first six chapters of W. Palin Elderton, “Frequency Curves 
and Correlation,” C. & E. Layton, London, 1927. An even more con¬ 
densed summary, with directions and illustrative examples but little discus¬ 
sion of underlying theory, may be found in C. B. Davenport and M. P. Ekas, 
“Statistical Methods in Biology, Medicine, and Psychology,” John Wiley 
<fc Sons, Inc., New York, 1936. Chapter 7 of the “Handbook of Mathe¬ 
matical Statistics,” edited by H. L. Kietz, Houghton Mifflin Company, 
Boston, 1924, gives a valuable discussion of frequency curves including both 
Pearson’s forms and others. 


EXERCISES 

1. Suppose that you want to divide data into classes with uniform class 
intervals of five units. You wish to have the value 5 and its multiples at the 
class marks. List some of the classes as they would appear in a frequency 
table. 

2. Give several examples of data which are discrete and several examples 
of continuous data. 

3. Classify the data given on page 26 into a frequency table. Make the 
class interval 25, and have the value 25 and its multiples at the midpoints 
of the classes. 

4. Go to the library and record the number of pages in each of the first 
100 books that you find. Classify the results in a frequency table, following 
the rules of Sec. 3.22. 

6. Select from this or some other book three or four pages of solid reading 
matter, not broken up by illustrations, tables, or formulas. Count the 
number of words in each line, and make a frequency table showing the 
numbers of times that various numbers of words appear. It will be best to 
select only complete lines, omitting those which begin and end paragraphs 
if they are shorter than ordinary lines. Make your own rules on the treat¬ 
ment of abbreviations, hyphenated words at the ends of lines, etc. Continue 
until you have counted 150 to 200 lines. 

6. The “World Almanac” has, for several years, published a table giving 
facts about “Noted Americans of the Past.” The years of birth and of 
death are given for each such noted person. If we find the approximate 
age at which each of these people died by subtracting the year of birth from 
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the year of death, we get the following figures (taken from the 1941 edition, 
page 660, and being data on the first 204 persons listed in alphabetical 
order): 


33 

77 

76 

87 

79 

80 

80 

81 

75 

75 

66 

89 

56 

83 

74 

71 

65 

52 

76 

76 

50 

69 

73 

49 

94 

87 

70 

86 

69 

71 

85 

78 

83 

72 

71 

46 

53 

65 

66 

81 

90 

91 

78 

58 

81 

91 

84 

48 

81 

70 

68 

74 

88 

75 

48 

74 

66 

73 

77 

77 

76 

65 

72 

63 

54 

85 

74 

65 

48 

63 

47 

69 

59 

73 

46 

67 

77 

58 

60 

99 

59 

72 

73 

65 

84 

76 

41 

77 

50 

84 

75 

81 

72 

68 

45 

62 

84 

95 

59 

84 

86 

66 

62 

65 

60 

85 

66 

68 

62 

75 

60 

75 

59 

73 

72 

73 

59 

56 

62 

92 

65 

39 

30 

64 

76 

50 

78 

83 

82 

68 

37 

78 

82 

73 

92 

67 

81 

52 

71 

41 

94 

79 

78 

76 

56 

71 

70 

81 

48 

78 

93 

25 

71 

67 

34 

78 

62 

77 

67 

76 

89 

84 

55 

65 

92 

86 

79 

86 

83 

71 

69 

71 

63 

73 

45 

71 

83 

36 

77 

59 

44 

55 

37 

38 

57 

77 

80 

81 

40 

50 

64 

88 

64 

74 

85 

58 

71 

65 

77 

81 

84 

71 

70 

51 


Make a frequency table of these figures, following the rules of Sec. 3.22. 

7 . In any table the class marks always fall exactly halfway between the 
actual class limits. Under what circumstances will the actual class limits 
fail to fall exactly halfway between the class marks? 

8. The families of 898 working-class men in Bolton, England, were 
classified in 1924 according to the number of rooms occupied by the family 
with the following results: 1 


Number of rooms. 

2 

3 

4 

5 l 

r ”"i 

6 

7 

8 

Total 

Number of families. 

15 

477 

227 , 

169 

7 

1 

2 

1 

898 


It will be noted that this table is a frequency table shown horizontally 
as contrasted with our usual vertical arrangement. Show the data in a 
histogram. 

9 . Show the data of the preceding exercise in a frequency polygon. 

10 . Make a “less-than” ogive of the data of Exercise 8. 

11 . Make a cumulative frequency table from the data of Exercise 8. 

12 . Solve Sturges’ rule for the data of Exercise 6. 

13 . Divide the data of Exercise 6 into four classes with logarithmic fre¬ 
quency classes (see Sec. 3.20). 

14 . Suppose we are given the data of Table 3.13. Note that the class 
intervals are not equal. Make a histogram of the data, making proper 
adjustment for the inequality of class intervals (see Sec. 3.19). 

1 Figures quoted in R. G. D. Allen, “ Mathematical Analysis for Econo¬ 
mists, M p. 411 , The Macmillan Company, New York, 1939. 





60 


ELEMENTS OF STATISTICAL METHOD 


Table 3.13.— Frequency Table with Unequal Class Intervals 


Size of 
Items 

Number of 
Cases 

150-159 

15 

160-169 

60 

170-179 

85 

180-189 

98 

190-199 

105 

200-209 

104 

210-219 

97 

220-229 

83 

230-239 

62 

240-259 

88 

260-279 

56 

280-309 

45 

310-339 

15 


15 . Make a “more-than ” ogive of the data in Table 3.13. Decide in 
advance whether or not it is necessary to make any correction for the 
inequality in class intervals. 



CHAPTER IV 


MEASURES OF CENTRAL TENDENCY 

4.1. Averages. —We have seen that the statistician commonly 
groups masses of data together into frequency tables so that they 
will be easier to comprehend. But often he wishes to go even 
further, to compute some one number which will in some definite 
way represent all the numbers of the group. Any number that, 
in this way, is used to represent a whole series of values is called 
an average of those values. To be sure, the word “average ” is 
used in common speech to mean one particular kind of representa¬ 
tive figure—a representative figure computed in a particular way. 
But technically there are many kinds of averages, and sometimes 
the statistician uses one and sometimes another. These various 
representative values or type values or averages are computed in 
various ways, and they represent the group in various ways. It 
is the purpose of this chapter to investigate some of the more 
commonly used averages and to ascertain their characteristics. 

Although there is no limit to the number of ways in which one 
could select a value as representative of the group, there are in 
practice only a few ways in which statisticians find it worth while 
to attack the problem. We shall confine our study to the 
methods that are in most common use among statisticians. In 
this chapter, we shall consider the ways of finding representative 
values when each of our original figures is given individually; 
in the following chapter, we shall study the same problem as it 
is handled when the data have been grouped together in frequency 
tables. And at the end of the next chapter, we shall study the 
use and interpretation of the results. 

4.2. The Arithmetic Mean: Ungrouped Data. —The arithmetic 
mean 1 is the measure most people have in mind when they use the 
word “average.” The concept is familiar to every student and 
needs no discussion here. The arithmetic mean of a series of 

^his measure is called indiscriminately the “arithmetic mean,” the 
“arithmetic average,” or merely the “mean,” 
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values is the quotient obtained by dividing the sum of the values 
by the number of values. We can symbolize this computation as 
follows: 


where each of the original figures is represented by X . 

N = the number of cases. 

2 means “the sum of.” 

X represents the mean of the X’a, 

Thus the formula should be read, “The mean of the X’a is the 
sum of the X’a divided by the number of cases.” 

Let us illustrate. We have five numbers (N = 5), as follows: 

7; 4; 6; 3; 10 


If we add them (2X) we get 30. Thus, since XX = 30 and 
N = 5, our formula becomes 


* “ IT 



It is well to become accustomed to statistical symbols in a case 
such as this, where the student knows in advance what is expected 
of him. Every student knows how to find the average of these 
five numbers without taking a course in statistics and without 
having a Greek-letter formula to guide him. But this is a good 
opportunity for him to discover that statistical formulas are but 
shorthand directions for computations. If one understands the 
symbols, one knows that XX/N says, “Add up the X’a and divide 
the sum by the number of cases.” And since the symbols always 
mean the same thing, when they have once been mastered it is 
easy to follow their directions. It would pay to learn them as 
they come. So far we have these: 

X always refers to the figures with which you start. (If you 
start with two series of figures, one may be called X and one F, 
or one may be called X x and the other X 2 , etc.) 

X (the Greek capital letter sigma) means “the sum of the things 
which follow.” It is the sign for addition. 

N always means “the number of cases.” 

If we were to find the average of the 90 examination marks 
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given on page 26, we should use the same formula; that is, we 
should divide the sum of the marks by 96, the number of marks. 

4.3. Weighted Arithmetic Mean. —Sometimes we wish to find 
the average of several numbers which are not of equal impor¬ 
tance. In such a case it is necessary for us to add one slight 
complication to our method. The addition can best be explained 
in terms of an illustration. Suppose that there are, in a given 
high school, 100 freshmen, 80 sophomores, 70 juniors, and 50 
seniors. On a given day 15 per cent of the freshmen are absent, 
5 per cent of the sophomores, 10 per cent of the juniors, and 2 
per cent of the seniors. What percentage is absent for the school 
as a whole? The student is likely to attempt to find the answer 
by adding the four percentages and dividing by 4. This would 
give him the following incorrect answer: 

15 + 5 + 10 + 2 32 0 

4 4 8 


We can quickly find, however, that 8 per cent is not the correct 
answer. There must have been 15 freshmen absent (15 per cent 
of 100), 4 sophomores absent (5 ner cent of 80), 7 juniors absent 
(10 per cent of 70), and 1 senior absent (2 per cent of 50). This 
makes 27 students absent altogether out of a student body of 300. 
Our correct answer, then, is 9 per cent rather than 8 per cent. 

In such a case, we commonly find the correct average by a 
process known as weighting. We determine how important each 
of our original numbers is and assign it a weight proportionate 
to its importance. We then multiply each number by its weight 
and add the products. The sum of the products is then divided 
by the sum of the weights. If we add one new symbol to those 
already listed in Sec. 4.2, we can represent the weight assigned 
to any number by the letter W. Then our formula for a weighted 
arithmetic mean would be 


y _ S(XTT) 


We see at once that this formula gives the following 



1. Multiply each original value ( X ) by the corresponding weight (TF). 

2. Add the products thus obtained. 

3. Divide this sum by the sum of the weights. 
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To set up our hypothetical example in formal maimer, so that 
we may see how the formula works, we list our original per¬ 
centages (15, 5, 10, and 2 per cent) as the values of X. We 
assign to each a weight (W) equal to its importance. In this 
case the weights are the numbers of students on which each 
percentage was based, on the grounds that a percentage based 
on 100 cases deserves more weight than one based on only 10 
or 15 cases. Our problem appears as follows: 


X 

W 

xw 

15 

100 

1500 

5 

80 

400 

10 

70 

700 

2 

50 

100 

Totals . 

300 

2700 


* _ 2(XW) _ 2700 _ 

A XW 300 

This time we get the correct answer, 9 per cent, at once. 

The student should note that we do not weight an average 
merely because we are fond of statistical computation, nor 
because we wish to impress the layman, but because weighting 
gives the right answer. 

Let us take one further illustration. In 1950, the populations 
per square mile of the New England states were approximately 
as follows: 


State 

Population 
per Square Mile 

Maine. . 

29.5 

New Hampshire. 

59.2 

Vermont. 

40.6 

Massachusetts . 

594.0 

Rhode Island .... 

719.5 

Connecticut . 

409.3 


What was the average density of population in New England? 
If we add the six numbers and divide their sum by 6, we shall 
get an incorrect answer, 308.7 persons per square mile. Again 
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it is necessary, if we want the right answer, to weight the average. 
When we stop to think about it, we realise that the large figure 
given for Rhode Island is based on very few square miles, while 
the small figure for Maine is based on many more square miles. 
We should not give the original figures equal importance, but 
should weight them according to the number of square miles in 
each state. We can then work out the correct population density 
for New England, as follows: 


State 

Population per 
Square Mile 
(X) 

Area (thousands 
of square miles) 
(W) 

i*W) 

Maine. 

29.5 

31 0 

914.5 

New Hampshire. . . . 

59 2 

9 0 

532 S 

Vermont. ... 

40 6 

9 3 

377.6 

Massachusetts ... 

594 0 

7.9 

4692 6 

Rhode Island.. ... 

719.5 

1 1 

791 4 

Connecticut. 

409.3 

4.9 

2005.6 

Totals. j 

63 2 

9314.5 


The average density of population per square mile can now be 
found by the formula for the weighted arithmetic mean, as 
follows: 


t _ 2(1?) _ 9,314.5 
A 2 W 63.2 


147.4 


The average density of population per square mile in New 
England was 147.4. 

Let us note why this weighted answer is the correct one. If 
we multiply the population per square mile of any state by the 
number of square miles in the state, we shall get the total popula¬ 
tion of the state. If we then add up these products for each of 
the states, we shall get the total population of the district. And 
if we divide this total population by the total area, we shall get 
the population per square mile in the total area. This is exactly 
what we did in the example above with the single exception that 
we carried out our computations in thousands of square miles 
instead of single square miles in order to save time. We could 
well have rounded off our computations even further in accord¬ 
ance with the rules of Chap. II. 
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Strictly speaking, every arithmetic average is weighted. If 
we add several numbers and divide by the number of them, we 
have merely weighted them all equally—given each a weight 
of 1. We can see that if each value is given a weight of 1 our 
formula for the weighted mean is reduced to the ordinary formula 
for the tl unweighted” arithmetic mean. 

Whenever we are taking the average of several percentages, 
averages, ratios (note that population per square mile is a ratio), 
or any other numbers which for any reason differ in their impor¬ 
tance, we must weight the average if we are to get the right 
answer. Sometimes the weights are entirely arbitrary, as when 
a teacher decides to weight the final examination in a course 
twice as heavily as a regular examination given during term time 
or to weight laboratory work half again as heavily as recitations. 
One observation of a solar eclipse may be weighted more heavily 
than another because of better visibility, more accurate instru¬ 
ments, more experienced observers, or for any other reason. We 
shall note in the next chapter one other common type of case in 
which weighting is necessary. 

4.4. The Median: Ungrouped Data. —The median is the value 
so chosen that there are just as many cases larger in value than 
the median as there are cases smaller in value than the median. 
In other words, if we arrange all the values in order of size, with 
the smallest item on one end and the largest item on the other, 
and if then we select a value in such a way that there will be the 
same number of items on each side of it, the value so selected 
is the median. 

It is easier to illustrate this concept than to describe it. If we 
take the five values which we used in illustrating the arithmetic 
mean, we recall that they were 

7; 4; 6; 3; 10 

First we arrange them in order of magnitude. When we have a 
series of values arranged in order of size, we say that we have an 
array . If we arrange these five items in an array we have 

3; 4; 6; 7; 10 

Now let us select such a number that there will be just as many 
values above as below. If we select the value 4, we find but one 
value smaller and three values larger: this will not meet our 
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requirement. If we select the value 5, we find two values smaller 
and three larger: again we have not met the requirement. Obvi¬ 
ously the only value that will suit us is 6: there are two values 
below 6 and two values above it. 

Let us illustrate again with the data on page 26. First we 
must arrange them in an array. This gives us the following: 

206; 203; 191; 181; 179; 176; 168; 166; 165; 164; 

163; 159; 158; 158; 158; 157; 156; 154; 153; 152; 

151; 150; 149; 147; 147; 143; 142; 142; 139; 138; 

136; 136; 133; 133; 131; 128; 123; 122; 121; 119; 

118; 114; 114; 113; 112; 112; 112; 109; 109; 107; 

107; 107; 106; 105; 104; 104; 103; 102; 102; 101; 

101; 95; 94; 93; 92; 90; 89; 89; 88; 85; 85; 85; 

84; 82; 82; 81; 81; 81; 81; 79; 76; 75; 75; 73; 

69; 65; 57; 55; 49; 43. 

Now that the items are arranged in an array, we must select a 
value that will divide the distribution into two parts with the 
same number of items in each part. We might start out at ran¬ 
dom, taking items that looked likely and seeing how many were 
larger and how many smaller. We might, for example, start with 
the value 110 and count the items which exceed 110 and those 
which are smaller. Trial will show that there are 47 items which 
are larger than 110 and 43 items which are smaller: it is obvious 
that we must select a value somewhat larger than 110. We 
could continue to try items in this manner until we discovered 
a value which met the requirement. Such a method, however, 
would be very wasteful of time. It is obvious, to begin with, 
that we want the item that lies at the center of the distribution. 
Suppose we arrange three items and want one that will divide 
the values evenly: we must obviously choose the second item. If 
we have four items we must select a point between the second 
and third. If we have five, as in the first illustrative example 
above, we know that we must choose the third. Experiment 
will show that, if we are to select the item that will divide the 
distribution into two equal parts, we must select the item that 
is (N + l)/2. Thus, if there are 5 items we must select item 
number (5 + l)/2, which equals 3. In this case we have 90 
items, and (90 + l)/2 = 45.5; that is, we must select a value 
which is halfway between the values of the 45th and the 46th 
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items. If we count in our array we discover that the 45th item 
from the bottom had a value of 112 and the 46th also has a value 
of 112. The median will be halfway between them, or, since 
they are identical in size, will be (112 + 112)/2 = 112. 

When the number of items in the array is even, the median is 
taken as the arithmetic average of the two central items. When 
the number of items is odd, the median is the value of the item 
which is item number (N + l)/2 from either end. Note that 
we do not find the median by evaluating the expression (N + l)/2. 
This formula merely tells us the position of the median. If there 
are 571 items, and we wish to find the median, we arrange them 
in array. The median is not 

(*+i) . <21+i> , 28. 

A A 

but the median is the value of the 286th item in the array. In 
other words, the median is found by first arranging the items and 
then counting them, finally taking the value of the item which is 
central, or, if there is no single central item, the average of the 
two central items. 

Note, then, that, if the median mark given on an examination 
was 112, we mean that as many students received more than 112 
as received less than 112. If we say that the median height of 
100 men was 5.5 ft., we mean that as many men were taller than 
5.5 ft. as were shorter than 5.5 ft. This might not be at all true 
of the arithmetic mean, as you will observe from the following 
example. The mean of the items 

4; 5; 6; 7; 203 

is 22 % — 45. But 45 is exceeded by only one of the items; four 
items are smaller. The median of these five items would be the 
value of the (N + l)/2 item, or the value of the (5 + l)/2 item, 
that is, the third item. This item has a value of 6: there are 
two items above it and two items below. Here it will be noted 
that the median and the arithmetic mean do not necessarily have 
the same value. 

4.5. The Mode: Ungrouped Data. —The mode is the value that 
occurs most frequently. The modal income of wage-earners in 
the United States is the most common income, the income which 
is received by more people than receive any other income. If 
we say that the modal size of farm in a given community is 78 
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acres, we mean that there are more farms of this size than of any 
other size. 

In any statistical problem with continuous data and with fine 
enough measurements, it is probable that no two values will 
exactly coincide. Hence there will be no one value occurring 
more often than any other value. It may be that no two men in 
the United States are of exactly the same height if we could meas¬ 
ure them with enough exactitude. How, then, could there be a 
modal height? In such a case we should group the data and 
compute the estimated mode from the groups of a frequency 
table. Or it may be that the crudity of our measurements will 
be such that the data are already grouped. Thus if we can meas¬ 
ure heights only to the nearest inch, so that all men between 
5 ft. 8.5 in. and 5 ft. 9.5 in. are recorded as being 5 ft. 9 in. tall, 
then we have patently grouped together many men whose heights 
are really slightly different. In this way we may find that many 
men seem to have the same height, and we get the same results 
that we obtain by conscious grouping of the cases. 

Table 4.1. —Frequency of Appearance of Various Numbers of Black 
Cards in 102 Deals 10 Playing Cards 


Number of 
Black Cards 

Frequency 

0 

0 

1 

1 

2 

5 

3 

12 

4 

18 

5 

34 

6 

22 

7 

7 

8 

2 

9 

0 

10 

1 


When we have discrete data, on the other hand, the mode may 
be easier to ascertain. Let us illustrate such a case. Table 4.1 
shows the number of times that various numbers of black cards 
appeared in 102 deals of 10 playing cards. Here the data are 
patently discrete, since we may get 4 black cards or 5 black cards, 
but never 4.26 black cards. And here the mode is also plainly 
marked. The commonest number of black cards—the number 
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which appeared more often than any other single number—was 
5. One may often find distributions in which there is no mode, 
that is, in which no single value appears more often than others. 
In yet other cases there may be two or more modes or points of 
concentration. In these respects the mode differs from the 
median and the arithmetic mean, since there is always one median 
or arithmetic mean and never more than one. 

4.6. The Geometric Mean: Ungrouped Data. —The arithmetic 
mean of a group of values was found by adding them and dividing 
the sum by their number. The geometric mean is computed by 
multiplying the values together and taking the nth root. Thus 
the geometric mean of the numbers 7, 9, and 11 is 

\/7 X 9 xTT = \/693 » 8.849 

This method of computation is useful if we are to average but two 
or three numbers, but if we are asked to average 12 or 50 or 200 
numbers we discover that the process involves the extraction of 
the 12th or 50th or 200th root. This is out of the question. We 
can arrive at the same result, however, by another method. The 
student will recall that adding the logarithms of numbers is 
equivalent to multiplying the numbers themselves together, and 
that dividing a logarithm by n is equivalent to extracting the nth 
root of the number. We can, therefore, work our problem by 
adding the logarithms of the n numbers, dividing the result by n, 
and taking the antilogarithm of the quotient. For example, 
suppose wc are required to find the geometric mean of the num¬ 
bers 12, 17, 33, 21, and 162. The long process would involve the 
multiplication of the five numbers and the extraction of the fifth 
root. The short method involves the addition of the logarithms 
of the five numbers, the division of the sum by 5, and the taking 
of the antilogarithm. The process follows: 


X 

log X 

12 

1.07918 

17 

1.23045 

33 

1.51851 

21 

1.32222 

162 

2.20952 

2(log X)7 .35988 

2(log X) 

7.35988 _ 

N 

5 


1.47198 


Geometric mean = antilog 1.47198 — 29.65 
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This process of discovering the geometric mean can be symbol¬ 
ized by the formula 


lOg Mg ~ 


Sflog X) 
N 


where M g represents the geometric mean and the other symbols 
have the meanings already attached to them. 

In both illustrative problems of this section, we have found 
geometric means which are smaller than the arithmetic means 
of the same numbers. Experiment will show that unless all 
the numbers being averaged are identical in size the geometric 
mean of a group of numbers is always smaller than their arith¬ 
metic mean. And, of course, if a single one of the original 
numbers is zero, their geometric mean is also zero. 

4.7. The Harmonic Mean: Ungrouped Data. —The harmonic 
mean of a group of numbers is the reciprocal of the arithmetic 
mean of their reciprocals. Thus, if we wish to find the har¬ 
monic mean of seven numbers, we first take their reciprocals. 
We then find the arithmetic mean of these reciprocals and take 
the reciprocal of the result. (The reciprocal of any quantity is 
the quotient that results when unity is divided by that quantity.) 
In the preceding section we found the geometric mean of the 
numbers 7, 9, and 11. Their harmonic mean would be found 
as follows: 


1 _ 1 
M + H + Hi 0.142857 + 0.111 111 -f 0.090909 
3 3 

« _I „ „ 1 - =87 

0.344877 0.114959 

3 

The values of the reciprocals are, of course, discovered from 
tables of reciprocals. It is not necessary to compute them each 
time. 

We note that the harmonic mean (which we can symbolize as 
Mh) of the numbers 7, 9, and 11 is 8.7; the geometric mean we 
have found to be 8.849, and the arithmetic mean is 9. If we take 
the second example which we used with the geometric mean, and 
compute the harmonic mean of the numbers 12, 17, 33, 21, and 
162, we find the following: 
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M h 


M h 


H2 + H7 ~j~ M3 + Mi + He2 


0.2262516 

5 

1 

0.0452503 


5 

= 0.0452503 

- 22.1 


If we again compare the results obtained by the three methods, 
we find 

Arithmetic mean = 49.0 
Geometric mean = 29.65 
Harmonic mean = 22.1 


Experiment will show that whenever we average a group of values 
the arithmetic mean will be larger than the geometric mean, and 
the latter will be larger than the harmonic mean (unless all the 
values averaged are of the same size, in which case the three 
averages will be identical). 1 

It is somewhat easier to compute the harmonic mean by a 
method other than that so far used. We have seen that the 
harmonic mean is based on the arithmetic mean of the reciprocals 
of numbers, and it was to show this that we used the method here¬ 
tofore presented. But note that 

1 N 

2 ( 1 /*) 2 ( 1 /*) 

N 

so that in practice we divide the number of items by the sum of 
their reciprocals. To compute the harmonic mean of our last 
example again by the shorter method, we have 

N 5 

Mh = 2(1/X) = 0.2262516 = 22-1 

It is impossible to compute the harmonic mean of any set of 
numbers if one or more of these numbers is zero, since division 
by zero is not allowed in mathematics. Where only a few num¬ 
bers are involved it is possible to compute their harmonic mean 
quickly by means of alternative formulas. If we have but two 

1 For proof of the fact that these inequalities will persist except when the 
items averaged are identical in size, see Davis and Nelson, “Elements of 
Statistics,” pp. 96#., Principia Press, Bloomington, Ind., 1935. 
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values, a and 6, their harmonic mean is 2 ab/(a + 6). If we have 
three values, a, 6, and c, their harmonic mean is Sabc/(ab + 
ac + 6c). The student can easily derive these formulas for 
himself. 

Not only is it true that the value of the geometric mean of any set of 
numbers always lies between their arithmetic and their harmonic means, 
but in the special ease where we are dealing with two numbers we can show 
that the geometric mean of the two numbers is also the geometric mean of 
their arithmetic and harmonic means. Suppose we let the two numbers 
be represented by x and y. Then their arithmetic mean is (x + V)!% their 
geometric mean is y/xy , and their harmonic mean is 

2 _ 2 ^ 2 xy 
1,1 x + y x + y 
x y xy 

Using these formulas, we notice that the geometric mean of the arithmetic 
and the harmonic means is 

^ 

But we have just seen that this is the geometric mean of the two briginal 
numbers; so it is evident that for this particular case (where there are but 
two numbers involved) the geometric mean of the two numbers is identical 
with the geometric mean of their arithmetic and harmonic means. 



Perhaps the relationship between the sizes of the arithmetic mean and the 
geometric mean of two numbers can be most easily visualized by means of 
the diagram in Fig. 4.1. Here we have a semicircle, with its diameter cut 
into two sections a and b by the perpendicular m. By elementary geometry 
we can demonstrate that m is the geometric mean of a and 6, while their 
arithmetic mean is the radius of the circle. When the perpendicular is raised 
at the center of the diameter, a * b m; and we have the limiting case in 
which the two original values are equal and the arithmetic mean equals the 
geometric mean. But whenever the perpendicular is erected at any point 
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other than at the center of the diameter, the perpendicular will be shorter 
than the radius, and the geometric mean will be smaller than the arithmetic 
mean. 


4.8. The Quadratic Mean: Ungrouped Data.—The quadratic 
mean of a group of numbers is found by squaring the numbers, 
finding the arithmetic average of the squares, and taking the 
square root of the result. We can illustrate again with the three 
numbers 7, 9, and 11. The squares of the numbers are 49, 81, 
and 121. The sum of these squares is 251, and their arithmetic 
average is 83.66. The square root of 83.66 is 9.15. If we take 
the other set of numbers with which we have illustrated our 
earlier averages, 12, 17, 33, 21, and 162, we proceed to find their 
quadratic mean as follows: 


2X 2 

N 


X 

X 2 

12 

144 

17 

289 

33 

1,089 

21 

441 

162 

26,244 


= 28,207 


28,207 
“ 5 


5,641.6 


M q = V5,641.6 = 75.1 


If we represent the quadratic mean by the symbol M q} we can 
describe these calculations by the following formula: 



If we bring together the four averages which we have so far 
computed, all based on the numbers 12, 17, 33, 21, and 162, we 
find the following values: 


M q = 75.1 
X = 49.0 
M 0 = 29.6 
M h = 22.1 


It may seem to the student that only one of these can be correct, 
and that the other three must be in error. When, moreover, it is 
pointed out that one always 1 finds this same general sort of thing 

1 Except in the limiting case where all the original values are equal. In 
such a case, all four averages will be equal also. 
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—the quadratic mean largest, followed by the arithmetic mean, 
the geometric mean, and the harmonic mean the smallest—the 
question arises why one ever computes such peculiar averages. 
Suffice it to say at this point that sometimes one of these averages 
is “ correct” and sometimes another, depending on what the 
figures represent and in what way we wish to typify them. Just 
as we saw (Sec. 4.3) that one sometimes weights an arithmetic 
mean because such a procedure does actually give him the right 
answer, so we shall see that sometimes one uses the harmonic, the 
geometric, or the quadratic mean because it gives the right 
answer. The discussion as to which kind of average to use under 
which circumstances is found toward the close of the following 
chapter (see Sec. 5.22). 

4.9. Quartiles, Deciles, and Percentiles: Ungrouped Data.— 

The median is sometimes called an “ average of position that 
is, it is defined as the value of an item which holds a certain posi¬ 
tion in the array. It is, we have seen, the item which is so 
located that it divides the array into two parts, there being the 
same number of items in each part. We could, of course, find 
the two points which divide the array into three parts or the seven 
points which divide the array into eight parts. In fact we do 
often wish to find the points which divide the array into 4, 10, or 
100 parts. 

The three points which divide the array into four parts in 
such a way that each part contains the same number of items 
are called the quartiles . Just as we found that the median item 
could be found by counting (N + l)/2 items from either end, 
so the quartiles can be found by counting (N + l)/4 items from 
each end. If we revert to the case we used in illustrating the 
median—the examination marks which were listed on page 26 
and arranged in an array on page 67—we discover that there are 
90 marks. Hence the position of the first quartile (the first 
quartile is always the smallest of the quartiles and the third 
quartile the largest, with the second quartile between) will be 
(N + l)/4 or (90 + l)/4 or 9 J4 or 22.75 items from the bottom. 
If we count up 22 items, we arrive at the value of 88. The 23d 
item is 89. Hence a point of the way between them will be 

88.75, and we say that the first quartile (symbolized by Qi) is 

88.75. 

Now let us count down from the top 22 items. This brings us 
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to a value of 150. The 23d item has a value of 149. If we locate 
the value which is % of the way from 150 to 149, we get 149.25. 
Hence we say that the third quartile (Q 8 ) is 149.25. The second 
quartile must obviously be at the center of the array; that is, it is 
identical with the median. Hence we never speak of the second 
quartile, but say that the two quartiles and the median divide the 
array into four parts in such a way that each part contains the 
same number of items as each other. To summarize our results 
for the array of examination marks, we could say 

Qa = 88.75 

Med. = 112 (see page 68) 

Qz = 149.25 

These three values are so chosen that they divide the array as 
required. 

We can give the formulas for the positions of the quartiles, 
then, as follows: 

a _(N + 1) 

Q} -4 

ft _2(N + 1) 

^ _3(AT + 1) 

-_- 


- - Med. 


This latter value for Qz shows how far up we would have to count 
from the bottom in order to reach the third quartile. In our 
illustration we counted down (N + l)/4 items from the top. 
Experiment will show that either method gives the same result. 

The deciles are the nine points which so divide the array that 
each part contains the same number of cases as each other part. 
In this case, as the name implies, the array is divided into 10 
groups. The formulas for the positions of the deciles, starting 
with the first (smallest) decile (D i) follow: 


D i 
D t 
D* 


(N + 1 ) 

10 

2 (N + 1) 

10 

3 (N + 1) 

10 


etc. 
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If we compute the first two deciles from the data on examination 
marks used before (page 67), we find 


Hence 


n - CgL±i) - ®i 
Dl ~ 10 ~ 10 


9.1 


9th item = 75 
10th item = 76 

3To of the way from 9th to 10th 


75.1 


Di « 75.1 

n 2(N + 1) 182 

D% “ —ir~ " To 


18.2 


18th item = 84 
19th item = 85 

%o of the way from 18th to 19th 


84.2 


And the last (9th) decile would be 

n _ 9(N + 1) _ 819 
9 10 10 


81.9 


81st item = 164 
82d item = 165 

%o of the way from 81st to 82d = 164.9 


Note that in each case here the items have been found to be one 
unit apart. Suppose, in the last illustration, that the 81st item 
had been 164 and the 82d item had been 168. The point % 0 of 
the way between the two would be at 167.6. This point is dis¬ 
covered as follows: The entire distance between 164 and 168 is 4. 
Nine-tenths of 4 is 3.6. Since we are going from the value 164 
toward the value 168, we add the 3.6 to the 164, getting 167.6, 
which would be the ninth decile under such circumstances. 

The 99 points which divide the array into 100 parts in such a 
way that the parts contain equal numbers of items are called the 
percentiles . As the student would anticipate from what has gone 
before, the formulas for the position of the percentiles (where 
Pi is the first percentile, P 2 the second percentile, etc.) are 


Pi 

P 2 

Pz 


(N + l) 

100 

2 (N + 1) 

100 

3 (N + 1) 

100 
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and so on, until we reach 


P 99 = 


99 (AT + 1) 
100 


If we take but one example in this case, using the examination 
marks again for purposes of illustration and computing the value 
of the 19th percentile, we find 


Thus 


19(iV + 1) 
100 


19(91) 

100 


1729 

100 


- 17.29 


17th item is 82 
18th item is 84 

2 Hoo of the way from 82 go 84 is 82.58 


19th percentile is 82.58 


4.10. Use of Quartiles, Deciles, Etc. —In the preceding chapter 
(see Sec. 3.5) it was pointed out that there are usually advantages 
in using uniform class intervals throughout a frequency table, 
although we saw in Sec. 3.18 that there are cases where it is 
worth while to make exceptions. Now we notice that, while a 
frequency table usually keeps the class interval constant and has 
varying frequencies in the classes, a set of quartiles or deciles or 
percentiles amounts to the same thing as keeping the frequencies 
of the classes constant and varying the class interval. For exam¬ 
ple, if we illustrate again with the 90 examination marks which 
appear in array on page 67, we find the following nine deciles: 

75.1; 84.2; 93.3; 104.4; 112; 126; 

142; 153.8; 164.9 


These points are not equally spaced. The first amounts to an 
open-end class including all cases below 75.1. The next is a 
class running from 75.1 to 84.2, with a class interval of 9.1. 
The next class, running from 84.2 to 93.3, also has a class interval 
of 9.1. The next class runs from 93.3 to 104.4, with a class inter¬ 
val of 11.1. The other class intervals are 7.6, 14, 16, 11.8, and 
11.1. Then there is the final class running 164.9 and over. 
While these classes have unequal class intervals (and in a mound¬ 
shaped distribution the class intervals will ordinarily be smaller 
toward the center of the distribution; the student should make 
sure that he sees why this is true) they contain equal numbers of 
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cases. In our illustrative problem, each class contains 9 marks. 
In fact, that is just how we drew them up. We defined the 
percentiles, for example, as the 99 points which divided the 
distribution in 100 parts in such a way that there were equal 
numbers of cases in the various parts. We see, then, that here 
is another case where, in practice, one occasionally wishes to get 
away from the equal class intervals which are so useful when we 
are expecting to carry on further computation, 

4.11. Summary of Averages with Ungrouped Data.—If each 
item is stated separately, rather than being grouped with others 
in a frequency table, we compute the various measures of central 
tendency (also called averages, or types) as follows: 

1. The arithmetic mean (%), also called the arithmetic average or the 
mean: 

Add the numbers given. 

Divide the sum by the number of cases. 

Formula: 



2. The weighted arithmetic mean. 

Assign a weight to each number. 

Multiply each number by its weight. 

Add the products just obtained. 

Divide the sum of the products by the sum of the weights. 
Formula: 

A " 


3. The median (Med.): 

Arrange the data in array. 

Count (N -f l)/2 items from either end. 

The value of this item is the median. 

4. The mode (Mo.): 

Count the number of times that each value occurs. 

The value occurring most frequently (if any) is the mode. 

5. The geometric mean (M 0 ): 

Find the logarithms of the values. 

Add these logarithms. 

Divide the sum by the number of cases. 

Take the antilogarithm of the quotient. 

Formula: 


log (Jf„) 


3 (log X) 
N 
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6. The harmonic mean (Mh)i 

Find the reciprocals of the numbers. 

Add the reciprocals. 

Divide the number of cases by the sum of the reeiprocals. 
Formula: 



7. The quadratic mean {M q ) : 

Find the square of each of the original numbers. 

Add the squares. 

Divide this sum by the number of cases. 

Take the square root of the quotient. 

Formula : 

,, A [ XX 2 

M v - \j ^ 

8. The quartiles (Qi, Q 3 )*- 

Arrange the data in an array. 

Count (N l)/4items from the lower end. 

The value of the item located here is Q i 

Count 3 (N + l)/4 items from the lower end. 

The value of this item is Q 3 . 

Qa is the median. 

9. The deciles and percentiles (I) lf D 2 , etc.; Pi, P if etc.): 

Arrange the data in an array. 

For the deciles find the value of the items which are multiples of 
(N -f 1)/10 from the end. 

For the percentiles find the values of the items which are multiples of 
(N + 1)/100 from the end. 

Having defined our terms, and having seen how these averages 
are computed when each of our original figures is given, we shall 
turn our attention in the following chapter to the methods used 
for finding these averages when the data are grouped together 
in frequency tables. 

EXERCISES 

1. Find the deciles of the data given in Exercise 6 at the end of Chap. Ill 
(see page 59). 

2 . Find the 89th percentile of the data of Exercise 6, page 59. 

3. Company A buys electricity at 3 cents per kilowatt-hour, Company B 
at 2 cents, and Company C at 5 cents. Company A uses 10,000 kw.-hr., 
Company B uses 8,000, and Company C uses 20,000. What was the average 
cost per kilowatt-hour? Use a weighted average. Explain why you use 
the particular weights you do, instead, for example, of using the capitaliza¬ 
tions of the companies or the numbers of their employees as weights. 

4 . Company A pays its employees an average wage of $28 per week. 
Company B pays an average of $35 per week. What figures would you 
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need for weights before you could find the average weekly wages of the 
employees of both companies combined? 

5. If you were given the wheat yield (bushels per acre) in each of the 
48 states of the United States, and you wanted to compute the yield (bushels 
per acre) for the United States, why would you have to weight the average 
of the 48 yields, and what figures would you use to weight them with? 

6. Find the quadratic mean, the arithmetic mean, the geometric mean, 
and the harmonic mean of the numbers 40 and 10. 

7. Show in the preceding example that the geometric mean of the two 
numbers is also the geometric mean of their harmonic and arithmetic means. 

8 . If you knew the batting average of each member of a baseball team, 
and wanted the team’s batting average, what additional information would 
you need? Under what circumstances would you get the correct answer 
if you took the simple arithmetic average of the figures for the various 
members of the club? 



CHAPTER Y 


MEASURES OF CENTRAL TENDENCY (Continued) 

5.1. Averages from Grouped Data. —We discovered in Chap. 
Ill that the statistician seldom retains his figures in their original 
form, since there are too many of them to be handled easily, 
and since the large number of figures tends to be confusing rather 
than enlightening. In order to compress the data within reason¬ 
able limits, and to make it possible to get an idea of the general 
nature of the distribution at a glance, he ordinarily classifies 
the data in a frequency table, showing merely the numbers of 
cases which fall in various classes. 

It would at first seem that when we have the data so arranged 
it would be impossible to subject them to further statistical 
manipulation. How can we find the averages of the data? 
How can we compute the value of the arithmetic mean, the 
median, the mode, or any of the other summary figures that wo 
studied in the preceding chapter? 

It would evidently be a foolish waste of time for the statistician 
to classify his data in frequency tables if thereafter he could 
carry on no further computations. In this chapter, we shall see 
how it is possible to compute the various averages of figures 
even if they have been grouped or classified in a frequency table. 
We shall then see how and when each average should be used, 
and how it should be interpreted. 

As for the computations themselves, we can understand the 
problem most easily in connection with an illustrative example. 
Castle gives figures (see Table 5.1) 1 showing the heights of 1000 
Harvard students between the ages of 18 and 25, measured at the 
Harvard gymnasium in the years 1914-1916. 

We have already seen that in such cases we do not know the 
value of a single item. We do not know the exact height of a 

1 W. E. Castle, “Genetics and Eugenics,” Harvard University Press, 
Cambridge, Mass., 1916. By permission of the president and fellows of 
Harvard College. (Data are adapted from data on p. 61 of this book.) 
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single student out of this group of 1000. To be sure, we can tell 
something about the distribution of heights. We know that 
no student was shorter than 154.5 cm. and that none was taller 
than 199.5 cm. We know that most of them were between 170 
and 180 cm. tall. But whether one student, or 15, or none, had 
a height of 168.3 cm. we cannot tell. How, then, can we tell 
anything about the average height, since we do not know the 
heights of anyindividuals? How, when our items have 

Table 5.1.— Heights of 1000 Harvard Students, Ages 18 to 25 


Height 

(centimeters) 

Number of 
Students 

155-157 

4 

158-160 

8 

161-163 

26 

164-166 

53 

167-169 

89 

170-172 

146 

173-175 

188 

176-178 

181 

179-181 

125 

182-184 

92 

185-187 

60 

188-190 

22 

191-193 

4 

194-196 

1 

197-199 

1 

Total. 

1000 


lost their identity in a frequency table, can we add them or 
multiply them together or arrange them in order of magnitude? 
How can we perform any of the operations which we have needed 
to perform in order that we may compute the various measures 
of central tendency? As a matter of fact, we can do these things 
only if we make certain assumptions. We must now discuss 
these assumptions and see how they enable us to compute aver¬ 
ages from grouped data. For the statistician computes averages 
from such data quite as often as Irom ungrouped data, and as a 
matter of fact he usually prefers to do so, because grouped data 
save him time and cause little inaccuracy. 

Since we do not know the exact height of a single one of the 
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53 students whose heights are recorded as falling between 164 
and 166 cm., we must assume something about the heights. We 
might assume that the heights were evenly distributed over the 
3-cm. range from 163.5 to 166.5, no two of the men being the same 
height and the differences in their heights being equal. We might 
assume that these 53 students were all of exactly the same height, 
in which case we should be likely to assume that they were all 
located at the middle of the class interval, or at 165 cm. Or 
we might make other assumptions that seemed reasonable. No 
matter what assumption we make we shall be likely to be some¬ 
what in error,'but we can surely choose a value for these heights 
that will not be in error for any one of the 53 men by more than 
1.5 cm.; that is, if we choose to assume that the students are all 
the same height (165 cm.), the error will not be large in any indi¬ 
vidual case. 

6.2. The Arithmetic Mean: Grouped Data. —When we com¬ 
pute the arithmetic mean of data which are grouped in frequency 
tables, we usually assume that the total of the values in any class 
is just what it would be if all the items were located at the mid¬ 
point of the class. This is the same as assuming that the items in 
the class are evenly distributed throughout the class: either 
assumption would give the same total. In the case that we have 
been using as an illustration, it is assumed that the total height 
of the 53 men in the group whose heights vary from 164 to 166 cm. 
is the same as the total height of 53 men who are each 165 cm. tall. 
In other words, their total height will be 53 X 165 = 8745 cm. 
Of course, if these 53 men were evenly distributed over the range 
from 163.5 to 166.5 cm., no two men being of the same height, the 
average height would still be 165 cm., and the total height would 
still be 8745 cm., as we discovered on the other assumption. 
Hence it makes no difference in this case whether we assume that 
the data are concentrated at the mid-points of classes or arc 
evenly distributed throughout the classes. In computing the 
mean we shall make the former assumption, as involving less 
arithmetic. 

If we assume that the 1000 students are located at several 
points, these being the class mid-points (or class marks , as they 
are sometimes called), then we can easily determine the average 
height. We say that the four shortest students have each a 
height of 156 cm., or a total height of 624 cm. The next eight 
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students, being concentrated at the height of 159 cm., have a 
total height of 1272 cm. If we continue thus throughout the 
table, multiplying in each case the class mark by the number in 
the class, we shall determine the total height of the students in 
each class. If, then, we add these products, we shall get the 
total height of the 1000 students. And we have already dis¬ 
covered that the average height is the total height divided by 
the number of cases. Thus we obtain our average easily once it 
is assumed that the heights are concentrated at the class marks. 


Table 5.2.— Computation of Arithmetic Mean from Frequency 
Distribution (Long Method) 


Height 

(centimeters) 

Class Mark 

(X) 

Number of 
Students 

(/) 

Total Height 

( fX ) 

165-157 

156 

i 

4 

624 

158-160 

159 

8 

1,272 

161-163 

* 162 

26 

4,212 

164-166 

165 

53 

8,745 

167-169 

168 

89 

14,952 

170-172 

171 

146 

24,966 

173-175 

174 

188 

32,712 

176-178 

177 

181 

32,037 

179-181 

180 

125 

22,500 

182-184 

183 

92 

16,836 

185-187 

186 

60 

11,160 

188-190 

189 

22 

4,158 

191-193 

192 

4 

768 

194-196 

195 

1 

195 

197-199 

198 

1 

198 

Totals. 

1000 

175,335 


Table 5.2 illustrates the process of finding the arithmetic mean 
from frequency data. In the first column the class limits are 
given as they appear in the original. But we assume that all the 
items within any class are located at the class mark, or mid-point, 
which is shown in the second column. We assume that any 
height between 154.5 and 155.5 was recorded as 155. Hence our 
first class presumably includes heights starting at 154.5; likewise, 
at its upper end, it presumably includes heights up to 157.5. 
This is a range of 3 cm., and we would find the class mid-point by 
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adding half this range (1.6 cm.) to the lowest limit of the range 
(154.5 cm.). Thus the class mark would be 154.5 + 1.6 => 156 
cm. And since we assume that all four students in the class were 
concentrated at this value, it must be that our X’s for these four 
students (that is, the original values with which we start our 
problem) are 156. Similarly the table shows 8 students with 
heights of 159 cm., 26 students with heights of 162 cm., etc. The 
number of students in each class of the frequency table we indi¬ 
cate by the letter /, which always represents the frequency with 
which items occur in a class of a frequency table. It is easy to 
remember that / stands for frequency. 

Finally, in the last column, we have the total height of the 
people in each group. If each of four people measures 156 cm. in 
height, their total height is 4 X 156 = 624 cm. Similarly 
throughout the table we have multiplied each class mark by the 
class frequency (the number of items in the class) to get the total 
height of the people in the class. Hence we label the final column 
/X, since it is found by multiplying the X values by the/ values. 

If we summate the last column we find the total height of the 
1000 students to be 175,335 cm. If 1000 students have a total 
height of 175,335 cm., then the average height is 175,335/1000 = 
175.335 cm. But since our original figures are given to the 
nearest even centimeter only, we should not give the average to 
three decimals (even though we should probably not be far in 
error by doing so). 1 Therefore we round off our result to even 
centimeters, making it 175 cm. 2 

We can summarize the directions for computing the mean from 
frequency distributions in a formula, as follows: 

y = S(/X) = 2(/X) 

2 / ' N 

This formula says, “Multiply each X by the corresponding/, and 
add the products. Divide the sum of the products by the sum of 

1 See Raymond Pearl, ‘'Medical Biometry and Statistics,” 2d ed., 
pp. 3G2Jf., W. B. Saunders Company, Philadelphia, 1930. 

1 In Castle, op. cit ., from which these data are extracted, the average 
height is given as 174.4 cm. This is contrasted with our 175.3 cm. If one 
were to assume that the class intervals as given mean 155-157.9, 158-160.9, 
etc., the average would, of course, be even higher. I have not discovered 
the cause of the discrepancy. 
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the frequencies (which is, of course, the total number of cases, 
or A).” 

If we compare this formula for the arithmetic mean of items 
in a frequency table with our formula for the weighted arithmetic 
mean (see page 63), we discover that they are similar save for 
one substitution. If we write the two formulas side by side this 
will be immediately apparent. 

2(XW) Z(fX) 

2W 2 / 

We see that the formula for use in frequency tables is a duplicate 
of the formula for the weighted arithmetic mean except that we 
have substituted the symbol / for the symbol W. Obviously, 
then, when we find the arithmetic average of numbers classified 
in a frequency table we have really computed a weighted arith¬ 
metic mean, using the frequencies as weights. 

6.3. Arithmetic Mean: Short Method.—The method we have 
just used for determining the average of data grouped in a fre¬ 
quency table is not the shortest possible method. In fact, it is 
not a method which would be used in practice. We have pre¬ 
sented it merely to show that it is possible to compute the mean 
from grouped data if we make the proper assumptions. Any 
statistician who wished to compute such a mean would always 
use what is called “the short method.” With this method we 
start by guessing at the mean and then adjusting the guess to 
meet the facts. This method is easiest to understand in con¬ 
nection with an illustration, and for this we shall use the data 
on student heights which appear in Table 5.1. 

In Table 5.3 the class limits are listed in the first column. 
Since the first class contains those items which vary in size from 
154.5 to 157.5 cm., we take the mid-point of the class as 156 cm. 
and list it in the second column. In the same way the other class 
marks are determined. Then comes the first step that is new. 
We look over the data and guess at the mean, choosing one of 
the class marks as the guessed mean. In this case we chose 174 
as the provisional or guessed mean. We then set down the num¬ 
ber of steps by which each class differs from the mean, and these 
deviations we list under the heading “class deviations,” and we 
symbolize them by the letter d , Thus the first class (155-157) 
is six classes smaller than the guessed mean, so we label it —6. 
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The class labeled “194-196” is seven classes larger than the 
guessed mean, so we label it +7. In this way we state each class 
in terms of the difference from the mean, measuring our differ¬ 
ences in units of the class interval. 

The fourth column is one with which we are already familiar. 
Here appear the frequencies as before. The last column is the 
product of the class deviations and the corresponding frequen¬ 
cies ( fd ). 


Table 5.3. -Computation of Arithmetic Mean from Frequency 
Distribution (Short Method) 


Height 

(centimeters) 

Class Mark 

(X) 

Class 

Deviation 

id) 

Frequency 

(f) 

(fd) 

155-157 

156 

-6 

4 

- 24 

158-160 

159 

-5 

8 

- 40 

161-163 

162 

-4 

26 

-104 

164-166 

165 

-3 

53 

— 159 

167-169 

168 

-2 

89 

-178 

170-172 

171 

-1 

146 

— 146 

173-175 

174 

0 

188 

0 

176-178 

177 

+ 1 

181 

181 

179-181 

180 

+2 

125 

250 

182-184 

183 

+3 

92 

276 

185-187 

186 

4-4 

60 

240 

188-190 

189 

+5 

22 

110 

191-193 

192 

4-6 

4 

24 

194-196 

195 

+7 

1 

7 

197-199 

198 

+8 

1 

8 

Totals. . 

1000 

4-445 


Now if we total the last column we find X(fd) = +445, and 
this divided by Xf (or N) — 44 ^fooo — 0.445. This tells us 
that the true average of these data is 0.445 class intervals above 
the guessed mean. (Had the sign of Xfd been minus, the true 
average would have been below the guessed average.) Now the 
class interval is 3 cm., and 0.445 class intervals make 1.335 cm. 
The guessed average was 174 cm., and, if we add thereto the cor¬ 
rection of 1.335 cm. which we have just found, we get 175.335 cm. 
as the average. This is exactly the same as the average which we 
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found when we carried on the computations by the long method, 
as will be seen by reference to page 86. 

What we have actually done by this short method is to assume 
a mean and to find on the average how far the items fall from this 
assumed mean. Our unit of measure is the class interval, which 
in this case is 3 cm. We know that if the mean is correctly 
chosen the sum of the positive deviations will exactly offset the 
sum of the negative deviations, so that when, as in this case, the 
positive deviations are larger than the negative, the assumed 
mean is too small and must be raised enough so that positive and 
negative deviations will balance. If the negative deviations 
exceed the positive we must, on the other hand, lower the average. 
Our process consists in finding out by how many units (class 
intervals) we must adjust the assumed mean to make it coincide 
with the true mean. 1 

Any point may be chosen as the assumed mean, although the 
work involved is much less if the assumed mean is at the mid¬ 
point of one of the classes. Also it helps somewhat if the class 
chosen is near the middle of the distribution, so that the numbers 
used are as small as possible. 

The so-called “short process” of computing the mean seems 
like a long process when described in such detail. If the student 
will compute a mean from a frequency distribution, using first the 
long and then the short method, and timing the process, he will 
discover that the short method is correctly named. We may 
summarize the steps of the short method as follows: 

1. List the class marks. 

2. Locate a guessed mean near the middle of the distribution and at a class 
mark. 

3. State the other classes in terms of class deviations from the guessed 
mean. The deviations are plus and minus, and the signs are important. 

1 That the algebraic sum of the deviations from the arithmetic mean must 
equal zero is shown by the following: 

Each deviation from the mean may be defined as follows: 

* * X - X 

The sum of the deviations is, then, 

Zx * Z(X - X) - ZX - X(X) 

But since X * ^ 

ZX - JV(X) 

It is therefore obvious that ZX — N(X) * 0 and that Zr « 0. 
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Values lower than the guessed mean are minus; others are plus. The class 
containing the guessed mean is marked “0.” 

4. List the frequencies of the classes. 

5. Multiply each frequency by its class deviation, keeping the plus and 
minus signs. 

6. Add the products obtained in the preceding step. 

7. Divide the sum just obtained by the sum of the frequences (that is, 
by N) and multiply the quotient thus obtained by the class interval. 

8. Add the result obtained in the preceding step to the guessed mean 
of step 2. 

If we are to boil these directions down into a convenient 
formula which gives directions for computing the mean from 
grouped data, we shall need some new symbols. Let d represent 
the distance (measured in units of the class interval) from the 
assumed mean. Let X' represent the assumed mean (as distinct 
from the real mean, which is represented by X). Let Ci represent 
the class interval, which in the illustration was 3 cm. Then our 
formula is as follows: 1 

x = x' + a = x' + a 

In our case this becomes 

* - 174 + 3 (To®) - 176335 

1 On p. 86 we defined the mean of a frequency distribution thus: 

x - — Ar - 

But each actual value of X is equal to the guessed mean plus (or minus) an 
amount equal to the number of class deviations times the class interval 
That is, 

X - X' + ( Ci)(d) 

Substituting this in the formula above gives 

* _ S{/[£' + 

X N 

z[f(Ci )(d)} sjL/XgOj 
N ^ N 
_ ri 2(/d) . X'( S/) 

N + N 

Since, however, 2/ =* N, this becomes 

X = X' + Ci 
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As before, we should round off the answer to 175 cm. to corre¬ 
spond with the accuracy of the original figures. 

5.4. Checking Accuracy of Computations. —Whenever a good statis¬ 
tician gets an answer to any problem, he immediately checks it to sec if it is 
reasonable. Suppose, for example, that we had found in our last example 
an average? height of 487 cm. We know by looking at our original table 
that this is out of the question, since no student had a height greater than 
109.5 cm. Yet it amazes teachers of statistics year after year to have 
students turn in answers on examinations which are as obviously wrong as 
this one. It is a good plan to study your data before you start your compu¬ 
tations, estimating roughly what answer should be expected. Then if the 
computations give an answer which differs widely from that expected, one 
should question the accuracy of his work. 


Table 5.4.— Computation of Arithmetic Mean from Frequency 
Distribution (Short Method) with Charlier Check 


Height 

(centi¬ 

meters) 

Class 

Mark 

(X) 

Class 

Deviation 

(d) 

Frequency 

(f> 

(fd) 

fid + 1) 

155-157 

156 

-6 

4 

- 24 

- 20 

158-160 

159 

-5 

8 

- 40 

- 32 

161-163 

162 

-4 

26 l 

-104 

- 78 

164-166 

165 

-3 

53 

-159 

- 106 

167-169 | 

168 

-2 

89 

-178 


170-172 1 

171 

-1 

146 

-146 


173-175 j 

174 

0 

188 

0 


176-178 1 

177 

+ 1 

181 

181 


179-181 1 

180 

+2 

125 

4 250 


182-184 

183 

+3 

92 

276 


185- 1S7 

186 

+4 

60 

240 


188-190 

189 

+5 

22 

110 


191-193 

192 

+6 

4 

24 


194-196 

195 

+7 

1 

7 


197-199 

! 198 

+8 

1 

8 


Totals. 

1000 

+445 

+1445 


Fortunately with some statistical computations it is possible to check 
the accuracy of the work as one proceeds, so that at the end of any step 
in the process he can tell whether or not errors have been made. When such 
checks are possible, it is wise for the student to get in the habit of using 
them, since they take little time and effort at worst, and at best may save 
hours of rechecking and recalculating. In the case of the arithmetic mean 
computed from a frequency table, it is possible to apply what is known as the 
“Charlier check” to prove the accuracy (or demonstrate the inaccuracy) 
of our arithmetic. 
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To compute the Charlier check for the arithmetic mean, we merely add 
one column to our table by showing the values of f{d -hi). This gives us 
Table 5,4, which the student should compare with Table 5.3. It will be 
seen at once how the figures in the last column are derived. 

The first figure in the new last column is —20. This is found by adding 
1 to the value of d (which gives us — 6 + 1 ** —5) and multiplying by the 
value of / (which is 4), Each figure in the last column is found similarly, 
by adding 1 to the value of d, and multiplying by the corresponding value 
of /. To take one more case, the fourth from the last item in the column is 
the number 132. It was found by adding 1 to the value of d (which was 5) 
to get the value of d + 1, or 6. This value was multiplied by the corre¬ 
sponding value of / (22) to get the value 132 in the last column. 

When this check is applied we find that, if our arithmetic has been correct, 
the sum of the last column, X [f(d + 1)], will always be equal to the sum of 
the totals of the two preceding columns. In the case of Table 5.4, we note 
that the last column yields a total of 1445, which is the sum of the two pre¬ 
ceding totals, 1000 and 445. Sometimes the sums of one or two of the 
columns are negative, and the student must be careful to keep track of 
signs. For example, it might be that the Bum of the frequencies would be 
865, the sum of the values of fd might be —148, and in that case the sum of 
the values of f(d + 1) should be 865 + ( — 148) = 717. But if the total 
of the last column is not equal to the algebraic sum of the other two totals, 
some mistake in arithmetic has been made. 1 


6.5. Grouping Error with the Arithmetic Mean. —In computing 
the arithmetic mean from a frequency table, we have assumed 
that all the items in any particular class are concentrated at the 
mid-point of the class. Of course, our results would be the 
same if no two of the items in the group were the same size, but 
if they were arranged at equal intervals throughout the class 
from the lower to the upper class limit. Likewise our assumption 
would bring no inaccuracy no matter how irregularly the items 
were scattered in the class if the average of the items within each 
class were equal to the class mark. Therefore we can say that 
any one of three assumptions would give us the same results, 
namely: 

1. All items within a class are the same size, each equal to the class mark. 

2. No two items in the class are the same size, and the values are spaced 
equidistantly throughout the class interval. 

3. The average of the items within each class is equal to the class mark. 

1 The student who is mathematically inclined will prefer to have this 
statement proved. 

f(d + 1) ~fd+f 
X[f(d + 1)] - Xfd 4- Xf 
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In practice, probably none of these assumptions is exactly 
true. Yet if in one class the' items run a little larger than the 
class mark we expect by the laws of chance that in some other 
class they will tend to run a little lower than the class mark, and 
if we have a large enough number of items and a large enough 
number of classes we should expect any errors to cancel each other 
out. We find empirically that the arithmetic mean computed 
by the methods just described is reasonably accurate. 

The student can easily test the accuracy of the method by 
trying it in cases where the actual average is known. For 
example, on page 26 appear 90 marks received by students on 
an examination. The arithmetic mean of these 90 marks is 
10,635/90 = 118.17. If we group these data in a frequency 
table with a class interval of 5 and the lower class limits at 40, 
45, 50, etc., we can compute the average from the frequency 
table, in which case we get an answer of 118.44. Table 3.2 shows 
the data arranged in a frequency table with a class interval of 10. 
Here the average is again 118.44. If we make up a frequency 
table with a class interval of 25, with the lower limits at 25, 50, 
75, etc., we find an average of 120.28. Table 3.3 shows these 
data classified with a class interval of 50. Here the arithmetic 
mean is 121.11. The student will note that our results from the 
frequency table fall very close to the actual average of the 90 
marks as long as we have a reasonably large number of classes. 
It can be said in general that the grouping error, which arises 
when all the cases in a class are treated alike, increases as the 
class interval increases and is less for continuous than for discrete 
data. 

6.6. The Median: Grouped Data.— If we study the data on 

heights of students with the intention of determining the median, 
we find the method very similar to that used when the data were 
not grouped. The problem is still that of discovering a value 
such that it will divide the distribution into two groups containing 
the same numbers of items. As before we start by determining 
how far from the lower end of the distribution we shall have to 
go to reach such a value. When the data were not grouped, the 
median value was number (N + l)/2 from either end. When the 
data are grouped, the problem differs slightly, and the median 
item is number Nj 2 from either end. Unless we make this 
change, we shall get one answer when we start from one end and 
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another answer when we start from the other end. Since we 
have 1000 items, we must find the N /2 item or the 100 % item or 
the 500th item; that is, we wish to know the size of the 500th 
item. If the 1000 students are arranged in an array, we wish 
to know the height of the student who is number 500 from either 
end. 

Our original distribution of heights was as indicated in Table 
5.5. Let us start with the shortest men and count until we have 

Table 5.5.— Heights of 1000 Harvard Students, Ages 18 to 25 


Height 

(centimeters) 

Number of 
Students 

' 155-157 

4 

✓ 158-160 

8 

v 161-163 

26 

^ 164-166 

53 

c 167-169 

89 y 

4 170-172 

-446 \ 

173-175 

188 

v 176-178 

181 

179-181 

-ttS" 

* 182-184 

92 

* 185-187 

60 

* 188-190 

22 

* 191-193 

4 

* 194-196 

1 

* 197-199 

1 


reached the 500th man. The lowest class contains 4 men; the 
two lower classes contain 12 men when taken together; the three 
lower classes together contain 38 men. If we continue thus, we 
find that the six lowest classes contain 326 students and the 
seven lowest classes contain 514 men. If we want the man who is 
500th from the bottom, he must be farther along than the top of 
the sixth class, but not so far along as the top of the seventh 
class; that is, he must be located somewhere in the seventh class. 
If we assumed, as before, that all items in the seventh class were 
at the class mark, it would be an easy matter to locate the median. 
But now we vary our assumption, assuming that the items are 
evenly distributed throughout the range of the class; in other 
words, that the 188 items in the seventh class are considered 
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as being equidistant and as being spread over the entire distance 
from 172.5 (the actual lower limit of the class) to 175.5 cm. (the 
actiial upper limit of the class). If we follow this assumption 
it is relatively easy for us to determine the location of the median. 
The six lower classes contain 326 students, and we wish to locate 
the 500th student; that is, we have to go 174 items into the 
seventh class, or 17 ^88ths of the way from the lower limit 
toward the upper limit of the class. Since the class interval is 
3 cm., this means that we have to go up from the actual lower 
limit of the class a distance of ( 17 ^{88) 3 cm. or 2.78 cm. The 
actual lower limit is 172.5 cm.; so we have the equation 

172.5 + 2.78 - 175.28 cm. - Med. 

Note that this carries us almost to the actual upper limit of the 
class (175.5 cm.). This is to be expected, since we went up from 
the bottom of the class through 174 of the 188 items in the class. 
We can compare the median of 175 (which we get by rounding off 
the computed value of 175.28) with the arithmetic mean, which 
we have already discovered to be 175 (see page 88). Before the 
two values were rounded off they were 

X « 175.335 
Med. = 175.28 

It must not be thought, however, that it is necessary for the 
median and the mean to coincide or to be approximately equal. 
We saw on page 68 a case where they were decidedly different. 
That the median and the mean are, in this case, so nearly 
identical is due to the fact that the heights of the students were 
so symmetrically distributed. This can be seen roughly from the 
chart on page 96, which shows the heights of the students; but 
in the main the question of symmetry of curves will be postponed 
until later. 

Let us now summarize the methods used in finding the median 
from data grouped in frequency tables. The steps are these: 

1 . Compute N/2 to discover the location of the item desired. 1 

^he student should apply the method just illustrated for finding the 
median to the same distribution of heights, counting from the top instead 
of from the bottom. He will discover that the use of N/2 will give the same 
answer regardless of the end from which he starts, but the use of (N 4* l)/2 
will not. This is the reason for using N/2 here instead of the value used 
when the data are not grouped. 
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2. Add the class frequencies until the class containing the median item is 
discovered. 

3 . Find how many items one must count into this class to reach the 
median item. 

4. Find what fraction this is of the total number of items in the class. 

5. Add this fraction of the class interval to the actual lower limit of the 
class. 

When we are dealing with frequency curves, as in this chapter, 
we can well redefine the median as that particular value on the 
base scale of a frequency polygon from which a perpendicular will 



Hpiqht. centimeters 

Fig. 5.1. Number of Harvard students between the ages of 18 and 25 years 
with various heights, 1914-1916. 

divide the area under the curve exactly in half. In Fig. 5.1 if 
we erect a perpendicular from the point on the base scale which 
corresponds to 175.28 (our value of the median) we shall divide 
the area under the curve in two, and the two new areas will be 
equal or approximately so. 

We should also note that the definition of the median breaks 
down in the case of some discrete data. For example, Ernest 
Thompson Seton counted the number of eggs in each of 77 peli¬ 
cans* nests, and found that 4 nests contained 1 egg each, 65 nests 
contained 2 eggs each, 5 nests contained 3 eggs each, and 3 nests 
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contained 4 eggs each. In this case there is no median which 
meets our definition exactly; that is, there is no number of eggs 
which is exceeded by as many cases as those which fall short of 
it. If we pick 2 as the median, we find it exceeded by 8 cases 
with but 4 cases smaller. If we choose 3 as the median, we find 
09 cases smaller and but 3 larger. In such a case the whole idea 
of the median may break down. 



Fig. 5.2. Determination of the median from an ogive. 


6.7. Finding the Median from an Ogive. —We saw in Sec. 3.11 
that every frequency table can be converted in the form of an 
ogive. This ogive form is often useful for finding the approxi¬ 
mate values of the “averages of position,” such as the median 
or the quartiles. In Fig. 5.2 the data of Table 5.5 have been 
put in such form. We notice that the curve takes the peculiar 
S-shape assumed by mound-shaped frequency curves when they 
are converted to ogives (see Sec. 3.14). Since there are 1000 
cases in this distribution, the median will be the value of the 500th 
case. If we find the value 500 on the vertical scale, find the point 
on the ogive curve horizontally opposite it, and then drop a per¬ 
pendicular from this point on the ogive to the base scale, we can 
read the approximate value of the median on the base scale. 
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This has been indicated on the diagram by two dotted lines, and 
we see that the value of the median is approximately 175. Simi¬ 
larly we could read approximate values for quartilcs, deciles, etc. 
With 1000 cases the third decile would correspond to the 300th 
case, and we read from Fig. 5.2 a value of approximately 172. 
The value of the 61st percentile (the 610th case) seems to be 
approximately 177. Naturally the larger we make our scales, 
the more accurately we can read our values from the ogive. 

5.8. The Mode: Grouped Data. —The mode is the value which 
occurs most frequently, and it is usually easier to locate when the 
data are grouped in frequency tables than otherwise. If we look 
again at Table 5.1, page 83, which shows the distribution of 
students’ heights, we see at once that the most common height is 
somewhere in the neighborhood of 173 to 178 cm. Sometimes 
people call the class mark of the most populous class the mode. 
It is better to give this measure the name crude mode to distin¬ 
guish it from the more accurate computed mode. 

The mode is the least satisfactory of the measures of central 
tendency to compute. Some distributions show no mode at 
all, and other distributions show two or more. Even in those 
distributions which seem to show one marked spot of concentra¬ 
tion, such as the distribution of student heights pictured in Fig 
5.1, we run into the difficulty that different students studying 
the same data will find very different modes depending on how 
the data are grouped. One could classify the data on the student 
heights, for example, into 10 different frequency tables, varying 
the class interval somewhat and shifting the positions of the class 
limits even when the class interval remained the same, and the 
arithmetic average computed from each of these 10 frequency 
tables would be approximately the same as long as a reasonable 
number of classes was used. But these 10 different frequency 
tables, all based on exactly the same original data, would yield 
significantly different values for the mode were we to compute 
the mode by any of the commoner and simpler methods usually 
outlined in textbooks on elementary statistics. For this reason 
it seems unwise to stress here any of these makeshift and unreli¬ 
able methods. The best methods will be explained later in 
Chap. VIII, and here we limit ourselves to suggesting that the 
student who wants a rough approximation of the mode may use 
any one of three crude methods: 
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1 . Plot the frequency polygon. Superimpose a smooth freehand fre¬ 
quency curve (see Sec. 3.10). The mode is the value on the horizontal axis 
directly under the highest point on this frequency curve. Thus in Fig. 3.3 
on page 39 the mode appears to be approximately 176 cm. 

2. For frequency distributions which are mound-shaped and symmetrical 
or only moderately skewed, use the relationship 

Mo. = 3 Med - 2t 

This formula indicates that in such distributions the median lies between 
the arithmetic mean and the mode, and about twice as far from the mode 
as from the arithmetic mean. We have already found that the arithmetic 
mean of student heights was 175.335 and the median was 175.28 (see pages 
88 and 95). Substituting these values in our formula we find 

Mo. - 3(175.28) - 2(175.335) = 175.17 

The method is so rough that we should certainly round off the result to 
175 cm. 

3. Let m represent the smallest class mark in the frequency table, let 
g represent the number of groups or classes in the table, and let Ci represent 
the class interval. Then approximately 

Mn l 2 g(X - m) - C%(q - 1) 

In our problem of student heights (see Table 5.3) m is 156 cm., g is 15, 
Ci is 3 cm., and the arithmetic mean is 175.33 cm. Substituting these values 
in our formula gives a value of 175.21 cm. for the mode. 

6 . 9 . The Geometric Mean: Grouped Data.—We saw in the 
preceding chapter that the geometric mean is usually computed 
in practice with the aid of logarithms, and a quick review of the 
work taken up in Sec. 4.0 will show that the logarithm of the 
geometric mean is the arithmetic average of the logarithms of 
the original figures. Since the computation of the geometric 
mean with the aid of logarithms involves the computation of an 
arithmetic mean, and since we have already learned in Sec. 5.2 
how to compute an arithmetic mean of numbers grouped in a 
frequency table, our present problem really involves little or 
nothing in the way of new material. Instead of using the class 
marks, we shall use the logarithms of these class marks; but since, 
even if the class marks are themselves evenly spaced, their loga¬ 
rithms will not be evenly spaced, we must use the long method 
of Sec. 5.2 rather than the short method of Sec. 5.3. 

Table 5.6 gives the data on student heights, with the class 
marks from Table 5.2 in the first column and the frequencies 
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from the same table in the second column. In the third column 
appear five-place common logarithms of the class marks. These, 
of course, are log X. Our formula becomes 


log M 0 


S/Qog jO 
N 


Therefore we multiply each item in column two by the corre¬ 
sponding logarithm in column 3, to get the products which appear 
in column 4. The sum of column 4 is 2/(logX), amounting in 
this particular problem to 2243.56097. We divide this by 1000, 
the number of cases, to get, as the logarithm of the geometric 
mean, 2.24356. This corresponds to a geometric mean of 175.2 
cm. 


Table 5.6.— Computation of the Geometric Mean from a Frequency 

Table 


Class Marks 
(centimeters) 

(X) 

Number of 
Students 

(/) 

log X 

/(log X) 

156 

4 

2.19312 

8.77248 

159 

8 

2.20140 

17.61120 

162 

26 

2.20952 

57.44752 

165 

53 

2.21748 

117 52644 

168 

89 

2.22531 

198 05259 

171 

146 

2.23300 

326 01800 

174 

188 

2.24055 

421 22340 

177 

181 

2.24797 

406 88257 

180 

125 

2.25527 

281.90875 

183 

92 

2 26245 

208 14540 

186 

60 

2.26951 

136.17060 

189 

22 

2.27646 

50.08212 

192 

4 

2.28330 

9.13320 

195 

1 

2.29003 

2 29003 

198 

1 

2.29667 

2 29667 

Totals . . . . 

1000 


2243 56097 


The geometric mean is, as usual, slightly smaller than the 
arithmetic mean, but the student will notice that in this case the 
difference is almost negligible. 

5.10. The Harmonic Mean: Grouped Data.—Just as the geo¬ 
metric mean is based on an arithmetic mean of logarithms, so 
is the harmonic mean based on an arithmetic mean of reciprocals. 
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The formula for this average when found from a frequency table 
can best be written thus: 



Illustrating again with the student heights, we adapt the “long” 
method for the arithmetic mean. Table 5.7 illustrates the pro- 

Table 5.7. —Computation of the Harmonic Mean from a Frequency 

Table 


Class Marks 
(centimeters) 

(X) 

Number of 
Students 

</> 

C f/X ) 

156 

4 

0.02564 

159 

8 

0 05031 

162 

26 

0 16049 

165 

53 

0 32121 

168 

89 

0.52976 

171 

146 

0 85380 

174 

188 

1.08046 

177 

| 181 

1 02260 

180 

125 

0 69444 

183 

| 92 

0 50273 

186 

60 

0 32258 

189 

22 

0 11640 

192 

4 

0 02083 

195 

1 

0 00513 

198 

1 

0 00506 

Totals .... 

1000 

5.71144 


cedure. The first column shows the class marks from Table 
5.2, while the frequencies appear again in the second column. 
The figures in the third column arc found by dividing each fre¬ 
quency by the corresponding class mark. The total number of 
cases (here 1000) is then divided by the sum of this third column 
(here 5.71144) to get the harmonic mean, which turns out to be 
175.09 cm. We notice here, as we have come to expect, that 
the value of the harmonic mean is smaller than that of either the 
arithmetic or geometric means, although the values of all three 
fall very close together in this problem. 

5.11. The Quadratic Mean: Grouped Data. —Just as the geo¬ 
metric mean of a series of numbers is based on the arithmetic 
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mean of their logarithms, and as their harmonic mean is based on 
the arithmetic mean of their reciprocals, so their quadratic mean 
is based on the arithmetic mean of their squares. We can give 
the formula for the quadratic mean of numbers grouped in a 
frequency table as follows: 



For our problem of student heights this gives us Table 5.8. Here 


Table 5.8.— Computation of the Quadratic Mean from a Frequency 

Table 


Class Marks 
(centimeters) 

(X) 

Number of 
Students 

(/) 

A2 

A* 

156 

4 

24,336 

97,344 

159 

8 

25,281 

202,248 

162 

26 

26,244 

682,344 

165 

53 

27,225 

1,442,925 

168 

89 

28,224 

2,511,936 

171 

146 

29,241 

4,269,180 

174 

188 

30,276 

5,691,888 

177 

181 

31,329 

5,670,549 

180 

125 

32,400 

4,050,000 

183 

92 

33,489 

3,080,988 

186 

60 

34,596 

2,075,760 

189 

22 

35,721 

785,862 

192 

4 

36,864 

147,456 

195 

1 

38,025 

38,025 

198 

1 

39,205 

39,205 

Totals. 

1000 

! 

1 

30,785,716 


the first column contains the class marks from Table 5.2. The 
second column contains the corresponding frequencies. The 
third column contains the squares of the corresponding numbers 
in the first column. And in the last column are the products 
found by multiplying each figure in the second column by the 
corresponding figure in the third column. The sums of the 
second and the fourth columns give us N and 2/(X 2 ), respec¬ 
tively, for use in our formula. 


In this problem we need to find the square root of 


30785716 

1000 


or 


the square root of 30785.716. This gives 175.4 cm., which is, 
as it should be, slightly larger than the arithmetic mean. 
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6.12. Quartiles, Deciles, and Percentiles: Grouped Data.— 

The quartiles, deciles, percentiles, and other “ averages of posi¬ 
tion/ J can either be approximated by inspection of the ogive, as 
explained in Sec. 5.7, or they may be located in an ordinary 
frequency table by methods based on the same assumptions as 
those used in computing the median. It is necessary merely to 
find the (N/4)th item or the (JV/10)th item or the (jV/100)th 
item instead of the (iV/2)nd item, which we found in the case of 
the median. The similarity of the methods makes it unnecessary 
to give an extended description of them here. The methods of 
finding the first quartile, the third decile, and the 57th percentile 
are given below as illustrations. The work here given, when 
studied in conjunction with the description of the location of the 
median, is self-explanatory. 

Locating the first quartile: 


N « 1000 



We wish to find the size of the 250th item. 


Tt is in class 6. 

It is the 70th item up in the class. 


70 


It is of the way up in the class. 


It 


-GS) 


(3 cm.) = 1.45 cm. from the actual lower limit of 


the class. 

Qi = 109.5 + 1.45 = 170.95 cm. 
Round off to give Qi = 171 cm. 


Since the first quartile is the 250th item, the t hird quartile is the 
750th item. It would be necessary to find the size of this item as 
we have found the size of the first quartile. 

Locating the third decile: 


N - 

b e 

3 io 


1000 

= 300 


We wish to find the 300th item. 
It is in class 6. 

It is 120 items up in the class. 
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It is of the way up in the class. 

It is (3 cm.) = 2.47 cm. from the actual lower limit of 

the class. 

D 8 - 169.5 + 2.47 - 171.97 cm. 

Round off to give Z> 3 — 172 cm. 

Locating the 57th percentile: 


N 

57 


- 1000 


N 

100 


= 570 


We wish to find the value of the 570th item. 
It is in class 8. 

It is 56 items up in the class. 

56 

It is yqt of the way up in the class. 



(3 cm.) = 0.93 cm. from the actual lower limit of 


the class. 

P 67 - 175.5 + 0.93 = 176.43 cm. 


Round off to get P 57 — 176 cm. 

Other quartiles, deciles, and percentiles would be computed 
similarly. If we interpret the three which we have just computed 
(using the rounded numbers), we find that Q\ indicates that 34 of 
the students are shorter than 170 cm. and 34 are taller than 170 
cm. The third decile shows that Ho oi the students are below 
171 cm. and 34 0 exceed this height. The 57th percentile shows 
that 57 per cent of the students are shorter than 176 cm. and 43 
per cent are taller. Similar interpretations would be given to 
other such measures. 

6.13. Summary of Averages with Grouped Data.—Just as we 

summarized in Sec. 4.11 the directions for computing the various 
averages when each of the original values was given separately, 
so we summarize here the methods for computing these averages 
when the data are presented to us in the form of a frequency table. 

1 . The arithmetic mean (£). Long method: 

Write down the class marks. 

Write beside each class mark the frequency in the class. 
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Multiply each class mark by the corresponding frequency. 

Add the products. 

Divide this sum by the number of cases. 

Formula: 

x "TT 

2. The arithmetic mean (X). Short method: 

Select as an origin a class mark near the center of the distribution. 

Write down beside each class mark the number of class intervals by 
which il exceeds (+) or falls short of ( —) the origin so selected. 

Multiply each frequency by the number written beside it. 

Add the products so obtained. 

Divide the sum just obtained by the number of cases. 

Multiply the quotient by the class interval. 

Add the result algebraically to the value of the class mark chosen m 
the first step. 

Formula: 

X - X' + Ci 

3. The median (Med.): 

Compute N /2 to find the location of the desired item. 

Add the class frequencies to discover which class contains the median 
item. 

Find how many items one must count into this class to reach the 
median item. 

Divide this number by the number of items m the class which contains 
the median. 

Multiply the decimal so obtained by the class interval. 

Add the product to the actual lower limit of the class which contains 
the median. 

4. The mode (Mo.). First approximate method: 

Multiply the median by 3. 

Multiply the arithmetic mean by 2. 

Subtract the latter product from the former. 

Formula: 

Mo. - 3 Med. - 2J? 

5. The mode (Mo.). Second approximate method: 

Subtract the smallest class mark in the frequency table from the 
value of the arithmetic mean. 

Multiply the remainder by twice the number of classes in the fre¬ 
quency table. 

Multiply the class interval by one less than the number of classes in 
the frequency table. 

Subtract the product just obtained from the product obtained in the 
step next preceding. 

Divide half the difference just obtained by a number which is one 
less than the number of classes in the table. 
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Add the quotient to the smallest class mark in the table. 
Formula: 


Mo. 


, 2 g(X - m) - Ci(g~ 1) 

m + -2T/“T)- 


6. The geometric mean (M g ): 

Write down beside each frequency the logarithm of the corresponding 
class mark. 

Multiply each logarithm by the corresponding frequency. 

Add the products. 

Divide the sum by the number of cases. 

This yields the logarithm of the geometric mean. To find the geo¬ 
metric mean, take the antilog. 

Formula: 


log M„ 


g/(log X) 
N 


7. The harmonic mean (Mh)- 

Divide each frequency by the corresponding class mark. 
Add the quotients. 

Divide the number of cases by the sum just obtained 
Formula: 



8. The quadratic mean (M v ): 

Square each class mark. 

Multiply each square by the corresponding frequency. 
Add the products just obtained. 

Divide this sum by the number of cases. 

Take the square root of the quotient. 

Formula: 


M q 



9. The quartiles (Q lt Qz): 

By adding frequencies in classes find which class contains the item 
that iB number N/4 from each end. 

Find how many caseB one must go into the class containing the 
quartile. 

Interpolate within the class as for the median. 

10. The deciles, percentiles, etc. (D h D it etc.; P i, P 2> etc.): 

Proceed as with the median except that the item wanted is the 
number N/10, 2A/10, iV'/lOO, or 2AT/100 instead of N /2. 


6.14. Characteristics of a Good Average. —Throughout this 
chapter and the preceding one, we have been considering the 
detail involved in the methods of computing various averages. 
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It is now time to consider the characteristics of these averages 
and their advantages and disadvantages. Logically it might 
seem desirable to have done this first, before we considered the 
details of computation; but pedagogically it is much easier to 
discuss the abstractions of advantage and disadvantage after 
the student has seen the thing discussed than before. 

We have seen that an average is a single value selected from 
a group of values to represent them in some way—a value which 
is supposed to stand for the whole group of which it is a part, as 
typical of all the values in the group. Tf we were to enumerate 
the qualities that we should desire in such a typical value if we 
could get a perfect one, it is likely that we should list at least the 
following seven characteristics: 

1. The number should be unequivocally defined, so that there 
can be no question, in any given distribution, as to just what the 
value is. It is important that the average be objective—possibly 
defined by an algebraic formula—so that if 10 different students 
all work with the same figures they will all (barring arithmetical 
mistakes) get the same answer. The average should not depend 
on the whim, caprice, or idiosyncrasy of the computer. 

2. The average should be inherently descriptive of the data 
in such a way that its meaning is easily understood. It should 
not be such a distant mathematical abstraction that it can be 
comprehended only by the advanced student. Statistical 
methods exist to simplify data, not to make them more complex. 

3. The average should, if possible, be easy to compute. This, 
however, is not so important as the ease of understanding. The 
statistician often performs difficult or tedious processes himself 
in order to get results that are easily understandable; and ease 
of computation, while desirable when other things are equal, is 
not to be sought at the expense of other advantages. 

4. The average should depend on every single item in the 
group, so that if we alter the value of any member of the group we 
shall alter the value of the average. The average is to be thought 
of as typifying all the members of the group, not merely some 
of them. 

5. Although every item should influence the value of the 
average, no item or items should influence it unduly. We should 
not want one or two extremely large or extremely small items 
to overshadow all the rest. We should prefer the items which 
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make up the group to have approximately equal influence on the 
average. 

6. We should like to get some value which has what the statis¬ 
tician calls “sampling stability.” This means that if we pick 
half a dozen different groups of college students, and compute the 
average of each group, we should prefer to get approximately 
the same value each time. We do not want our answer to depend 
too much on the particular 1000 students that we have studied, 
but we should like a value that is dependable—that will be about 
the same in one sample as in another. We know that there is a 
considerable difference in practice among the various averages 
that we have studied in this regard. Also we should prefer to 
get about the same answer whether we group with class intervals 
of 5 or 10, or whether we set different class marks with the same 
class interval. Minor variations in grouping of the items should 
not affect the average materially. 

7. Finally, we should prefer to have an average that can be 
easily used in further statistical computation. For example, if 
we have computed an average for freshmen, one for sophomores, 
one for juniors, and one for seniors, we should like to be able to 
combine them to get an average for the entire undergraduate 
body. 

5.16. Relationships between the Averages.—Before we take 
up the various averages for individual discussion and comparison 
with the criteria listed in the preceding section, it should be 
pointed out that certain relationships usually obtain among the 
different averages. The statistician is often almost as much 
interested in these relationships among two or more of the aver¬ 
ages as he is in the averages themselves. 

1. If the distribution is symmetrical, the values of the arith¬ 
metic mean, the median, and the mode will be identical, and if 
the distribution is nearly, but not quite, symmetrical their values 
will be almost identical. In other words, the similarity or diver¬ 
gence in the sizes of these three measures (or any two of them) is 
to some extent an indication of the symmetry of the distribution. 

2. As we have already seen (Sec. 5.8), if the distribution is 
mound-shaped and only moderately asymmetrical, the median 
lies between the arithmetic mean and the mode, being approxi¬ 
mately twice as far from the latter as from the former. 

3. In any distribution where the original items differ at all in 
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size, the following averages will all differ in size and their values 
will fall in the following order: 

Mq > > Mg > Mh 

In the limiting case where all the original items are identical in 
size (in which case we would hardly compute an “average”), 
these four averages would all be equal. 

5.16. Advantages and Disadvantages of the Arithmetic Mean. 
The arithmetic mean is certainly the most widely used and most 
commonly understood of all the averages. It is a value so 
selected out of a group that if all members of the group were 
uniform in size, and if they retained their actual total size, they 
would each be equal to the arithmetic mean. Or we can think 
of the arithmetic mean of N items as a single item made up of 
l/Nth part of each of the original items. If we say that the 
average income of 40 people is $134.50 per month, we mean that 
if each of the 40 people contributed J^oth of his income to a 
common fund, this common fund would amount to $134.50 per 
month. Thus we see that each value in the distribution plays 
a part in determining the arithmetic mean, and a change in any 
item will change the arithmetic mean. It is rigidly defined by a 
mathematical equation and is easy to compute. Sometimes 
we can compute it when we cannot compute other averages and 
when the values of the individual items are not known. For 
example, if we know that the total consumption of milk in the 
United States amounts to 51,100,000,000 quarts and that the 
population is 132,000,000, we do not need to know the facts for 
any individual family to compute an average consumption of 
387 quarts per capita. We could not compute a single one of 
the other averages from these data. It is also true, as we shall 
discover later (see Chap. IX), that the arithmetic mean is unusu¬ 
ally stable in sampling, running more uniform from sample to 
sample than any of the other averages. For scientific work this 
peculiarity of the arithmetic mean is of great importance. 

On page 89 we noted that the sum of the deviations of a group 
of individual items from their arithmetic mean was equal to zero. 
It is also true that the sum of the squares of these deviations 
is smaller when taken from the arithmetic mean than when taken 
from any other number. Let us take, for illustration, the num¬ 
bers 5, 8, and 14. If we compare these numbers with the number 
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10, we find that they differ from 10 by —5, —2, and 4. These 
differences, when squared, give 25, 4, and 16; and the sum of the 
squares is 45. Had we compared the three numbers with their 
arithmetic mean 9, we would have found the differences to be 
— 4, — 1, and 5. The squares of these differences are 16, 1, and 
25; and the sum of the squares is 42, which is smaller than the 
sum of the squares when the deviations were measured from 10. 
The student may try numbers other than 9 or 10, and he will 
find that the sum is smaller when deviations are taken from 9, 
the arithmetic mean, than when taken from any other number 
he can select. 1 This means that the arithmetic mean of a group 
of numbers is “ fitted by least squares,” an expression that we 
shall use later on in considering certain important theorems of 
probability. We shall discover then that a value which is fitted 
by least squares has more chance of being correct than any other 
value if the distribution is what we shall then call “ normal.” 
For our present purposes, we need merely point out that no 
measurement is ever made with complete exactitude (see Chap. 
II) and that when we measure anything over and over again we 
get different answers; so we can never be sure what measurement 
is absolutely correct. But if we measure the speed of light, the 
length of a line, or the weight of a cubic foot of water over and 
over again, getting slightly different measurements each time, 

1 The student of the calculus will be able to prove ibis fact easily for all 
cases, rather than relying on experiment. Suppose we are to choose any 
value M from which to measure the deviations d. Then for each value of 
our original series X , we have: 

X + d * M 

d - M — X 

d 2 - (M - X)* - M* - 2 MX 4- X 2 
2d 2 = Z(M* - 2MX -f X 2 ) NM 2 - 2.1/2X + 2X 2 
This is the sum of the squares of the deviations, which we wish to minimize. 
It will have its minimum value when the first differential is equal to zero. 
If we represent this function by /, we have 

=2 NM — 2ZX which we set equal to zero 
NM - ZX 


In other words, we shall get the smallest possible sum of the squared devia¬ 
tions when our value of M is ZX/N, that is, when it is equal to the arithmetic 
mean. 
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the arithmetic average of these measurements has a better chance 
of being the actual speed of light, length of the line, or weight 
of the water than any other figure which we can take. This 
“least-squares” property of the arithmetic mean, though often 
overlooked, is one of its most important characteristics to the 
scientist. 

As a final advantage of the arithmetic mean, we can point out 
that it is unusually adaptable when we wish to carry on further 
mathematical computations with it. Suppose we have three 
basketball teams made up of five men each. The average weight 
of the members of Team A is 145 lb., that of Team B is 158 lb., 
and that of Team C is 162 lb. We can combine these averages 
directly to find the average weight of the 15 players. Since 
each average is based on 5 men, we merely take the arithmetic 
mean of the three means, to get an average of 155 lb. for the 15 
men. But we can still compute this mean for the entire group 
from the means of the subgroups even if the subgroups do not 
contain equal numbers of cases. If we call our subgroups 1, 2, 
3, . . . , N, and represent the totals of the values in the indi¬ 
vidual subgroups as 2Ai, £A 2 , 2A 3 , . . . , 2A*r, it is evident 
that the grand total of all the values in all the groups thrown 
together is the sum of the totals in the subgroups, or, in the form 
of an equation, if we let 2A r represent the sum of all the items 
in ail the groups together, we get 

2 A = XA i 2A 2 ~f" 2)As ~b * * * A - SA# 

But 

2A = NX, SAi = N 1 A 1 , etc. 

Therefore 

NX = NiXi + A 2 A 2 + A 3 A 3 + • • • + N„X„ 

The desired value of A must be 

^ _ N&i + N 2 X* + N 3 X, + ■ ■ • 4- N n X„ 

Thus all we need to do is to take a weighted arithmetic mean of 
the averages, using as weights the numbers of cases in the indi¬ 
vidual subgroups. We multiply each subgroup average by the 
number of cases in that subgroup, add the products, and divide 
the sum by the grand total number of cases in all groups together. 
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This gives us the average for all groups together. No such com¬ 
putations can be carried out for the median or the mode. If we 
know the medians or the modes of several subgroups, we cannot 
find the median or the mode of the whole lot together. 

The arithmetic mean thus has so many advantages that it is 
used far more than all the other averages combined. We might 
almost say that in case of doubt the arithmetic mean should be 
used —that other averages should be used only when there is 
some clear reason for it. Yet the arithmetic mean does have 
one or two distinct disadvantages. In the first place, it is very 
sensitive to extremely large or extremely small items (especially 
so to large ones). The chance inclusion of such an extreme item 
in the group being studied may give us an arithmetic average 
which is not really typical of the group. Let us consider the 
numbers 6, 7, 7, 8, 8, 8, 8, 8, 9, 9, 865. The median of these 
numbers is 8, and their mode is 8, but their arithmetic mean is 
85.7. The latter figure does not depict well either the one large 
number or the 10 small ones. The inclusion of the single very 
large number at the upper extreme has thrown our arithmetic 
mean to a point far from any of the actual items in the group. 

Where there is marked skewness in a distribution, so that the 
arithmetic mean, the median, and the mode differ widely in 
value, one should always consider the possibility that the arith¬ 
metic mean is not a truly representative or typical value, and 
that the median or the mode should be used in preference to it. 

In the United States the distribution of incomes is decidedly 
asymmetrical, as will be seen from the accompanying chart . 1 
If it is said that the arithmetic average income for 1918 was 
$1690 (which is a rough average of the figures on which the chart 
is based), the reader is likely to be misled. Between 70 and 75 
per cent of the income earners of 1918 earned less than $1690. 
In other words, $1690 was not only the mean income but one 
of the higher incomes. The median income was roughly $1170, 
and the modal income was presumably even lower. 2 For many 

1 Based on figures from Warren C. Waite, “Economics of Consumption,” 
p. 22, McGraw-Hill Book Company, Inc., New York, 11)28. Waite quotes 
them from the National Bureau of Economic Research publication “Income 
in the United States,” Vol, I. The figures are for 1918, but there is good 
reason to believe that the asymmetry still exists. 

* If we were to compute the mode from the mean and median in accord¬ 
ance with the formula on p. 99, we should find that the modal income was 
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purposes one is likely to be more interested in the modal income 
in the United States than in the mean income. A few million¬ 
aires raise the mean income tremendously without raising the 
typical plane of living particularly. For this reason, then, we 
may also prefer to use some average other than the arithmetic 



Fig. 5.3. An approximate distribution of money incomes in the United 
States in 1918. The curve continues toward the right to incomes of millions 
of dollars. 

one. Similarly we realize that the arithmetic mean of a U-shaped 
distribution would fall at a point where values were uncommon, 
and that in such peculiar distributions the arithmetic mean would 
give a misleading idea of the distribution. 

6.17. Advantages and Disadvantages of the Median—The 
median is rigidly defined (if there is any median at all), and the 
concept involved is readily understood by anyone even though 
the term itself may be unfamiliar. If the data are in an array or 
a frequency table, the median is easy to compute, and items of 
extreme size have almost no influence on it. It has less sampling 
stability than the arithmetic mean. If we had 10 ^ groups of 
newborn babies and found the median weight in each group, 
we should discover that these medians not only differed in size, 
but their variation was about a quarter again as large as the 
variation in the sizes of the arithmetic means of the same ten 
samples. The median has the advantage that we can compute it 

$130. This is obviously altogether too low. It is to be remembered that 
the formula just mentioned is to be applied only in those cases where the 
asymmetry is moderate. In this case the asymmetry is great. 
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even from a frequency table with open-end classes, since we do 
not need to know the sizes of the extreme items as long as we 
know that they are extreme items. We can even find the median 
in cases which are nonmathematical in character, and where 
numerical measurement is impossible. For example, we might 
arrange a number of pieces of blue cloth in order of the intensity 
of their color. The color of the piece in the middle will then be 
the median color. Thus the median can be used with data which 
are nonmathematical in character. A characteristic of the 
median which is also sometimes useful is that the sum of the 
absolute deviations (disregarding plus and minus signs) is smaller 
when measured from the median than when measured from any 
other value. We showed in the preceding section that the 
algebraic sum (keeping track of signs) of these deviations was 
smallest when measured from the arithmetic mean. If we go 
back to the example which we then used, taking the numbers 
5, 8, and 14, the median is 8. The absolute deviations are 3, 0, 
and 6, giving a sum of 9. Had we chosen the arithmetic mean, 
which is 9, the absolute deviations would have been 4, 1, and 5, 
giving a sum of 10. If we chose still other numbers than 8 or 9 
from which to measure the absolute deviations, we should find 
that the sum of these absolute deviations was smaller when 
measured from 8 than from any other number. 1 

The median has the disadvantages that it is not quite so well 
known as the arithmetic mean (although easily explained), and 
that it is necessary to arrange the items in an array before it can 
be computed. Sometimes there is no median in a discrete 
series, as was illustrated at the end of Sec. 5.6. Also the median 
is not adapted to further arithmetical work. As we have seen, 
if we are told the medians of each of several subgroups, there is 


1 The student can readily see that this must be true if lie considers the 
fact that the sum of the distances from two given points to any third point 
is smallest when the third point lies between them and is always the same 
no matter where the third point lies on the line between them. The median 
has as many points on one side of it as on the other, so the points (or values 
in a distribution) can be taken in pairs, one each side of the median, and for 
each pair the sum of the distances will be less than for any point which does 
not lie between them. Therefore the sums of the distances for all such pairs 
will be smaller when measured around the value in the center than it will 
be for any point not so located that equal numbers of points lie below and 
above it. 
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no way of finding the median of the group as a whole. Moreover 
the median is not sensitive to changes in the values of the items 
that make up the distribution. We can change the sizes of items 
without influencing the value of the median at all as long as we 
do not change the size of any item enough to move it to the oppo¬ 
site side of the median. Each item does have a minor and 
tenuous influence on the size of the median merely by means of 
the fact that it is larger or smaller than the median, but aside 
from that the size is immaterial. It is an advantage, as we have 
seen, for an average not to be overly sensitive, but many workers 
feel that the median is not sensitive enough. 

5.18. Advantages and Disadvantages of the Mode.— Any 
average is a single value taken to represent a whole group of 
values. There would be no justification in selecting any such 
representative value if the items in the original group were not 
concentrated or clustered about some point. The fact that 
the mode indicates this point of heaviest concentration makes 
it in its abstract aspects perhaps the best average of all. We 
have already seen, in considering an illustrative problem on 
distribution of incomes (Sec. 5.16), that when a distribution is 
badly skewed or non-normal we are more likely to be interested 
in the mode than in the arithmetic mean. In fact, some authori¬ 
ties have suggested that if the mode and the arithmetic mean are 
significantly different in value, one should use the mode. This is 
perhaps going too far, yet we must realize that the mode is, as 
we suggested in Sec. 5.14, “inherently descriptive of the data in 
such a way that its meaning is easily understood.” It is the 
concept in which the layman is perhaps most often interested, 
even though he may not be familiar with the name. Moreover 
the mode is hardly at all influenced by the values of extreme 
items. 

On the other hand, the mode is difficult to compute, and the 
rough approximations to it which we have so far discussed are 
unreliable and peculiarly subject to instability of sampling. It 
is possible to make radical changes in the sizes of items in a distri¬ 
bution without changing the value of the mode at all. While 
it can be said that the mode depends on the values of all the items 
to the extent that the mode would have been different if enough 
of the items had been different, this is almost the same as saying 
that some of the items have little or no effect on the value of the 
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mode. A distribution may have no mode, and usually there 
will be no well-defined mode unless the number of cases is large. 
Or there may be two or more modes, although in such cases the 
statistician usually investigates to see if his data are really homo¬ 
geneous. For example, when studying a distribution of wages 
one might find two high points on his frequency polygon. He 
would then ask himself whether this might be because he had 
lumped together wages of men and of women, or wages of skilled 
and of unskilled workers, or wages of organized and unorganized 
workers. And finally, the mode cannot be easily used in further 
algebraic processes. If we have the modes of several distribu¬ 
tions, we cannot combine them to get a new mode of the joint 
distribution. It is unfortunate that an average which has such 
an intellectual appeal as the mode happens to be so difficult to 
compute and so unreliable after it is computed. 

6.19* Advantages and Disadvantages of the Geometric Mean. 
The geometric mean is unusual enough so that it is quite natural 
for the student to ask why anyone ever bothers to compute it. 
It is obvious that we can multiply 20 numbers together and take 
the 20th root of the product (or perform the equivalent computa¬ 
tion by means of logarithms), but why should we want to do it? 
The answer is that in certain sorts of problems this is the only 
way to get the right answer. Just as, in the case of the weighted 
arithmetic mean, we saw that one weights his average in order 
to get the right answer, and not because he likes the complica¬ 
tions of the method, so in the case of the geometric mean, the 
method is used in spite of its complications and not because of 
them. 

We can start our discussion of the geometric mean by pointing 
out that it meets certain of the requirements that we listed in 
Sec. 5.14. It is rigidly defined by a mathematical formula, so 
that the result does not depend in the slightest on the whim 
of the worker who computes it. It depends on the value of 
every item in the distribution; no single item can be changed in 
the least without affecting the value of the geometric mean. 
Its value is not quite so greatly influenced by extreme items as 
are the values of the quadratic, arithmetic, and harmonic means. 
And the result can be used in further statistical work. The 
geometric means of samples can be combined to get the geometric 
mean of the whole. 
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In addition to these advantages, however, there are two sorts 
of cases in which the use of the geometric mean is particularly 
indicated. First, we have those cases in which we are finding 
the average of values which aTe in geometric progression. For 
example, we might take the progression which runs, 1, 2, 4, 8, 16, 
etc. Suppose we wish to find the average of these five numbers. 

16 
14 
12 
10 
8 
6 
4 
2 
0 

Fig. 5.4. Arithmetic and geometric means of the numbers 1, 2, 4, 8, and 
16 . 

If we add them and divide by five, we find their arithmetic mean, 
6.2. If, on the other hand, we multiply them and take the fifth 
root, we get the geometric mean, 4. We note that 4 is actually 
the number which appears in the middle of the distribution, 
while 6.2 is not. Moreover, if we make a graph of the data, as in 
Fig. 5.4, we find that the value 4 really falls on the line, while the 
value 6.2 does not. 

We have a geometric series whenever the quotient found by 
dividing any term by the term following is constant throughout 
the series. In the series above, 1, 2, 4, 8, etc., if we divide any 
term by the term following we get the quotient 3^. We do not 
insist, of course, that the quotients be exactly equal, as long as 
they are approximately so. The population of the United States 
in the first eight censuses was (in millions) as follows: 


1790 

3 9 

1830 

12.9 

1800 

5 3 

1840 

17 1 

1810 

7 2 

1850 

23.2 

1820 

9 6 

1860 

31.4 
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If we divide each of these numbers by the one following, we get 
the following seven quotients: 0.74, 0.74, 0.75, 0.74, 0.75, 0.74, 
and 0.74. The approach to uniformity is startling. The series 
seems to be geometric. If we were to compute the average for 
this series we should take, not the arithmetic mean of 13.8 million, 
but the geometric mean 11.1. It will be noted that the arith¬ 
metic mean gives a value even higher than the population of 
1830, which is well beyond the middle of the period; while the 
geometric mean gives a value which not only lies between the 
populations of 1820 and 1830, but is almost exactly the geometric 
mean of them. 

As another example, if, during a 10-year period, a sum of 
money at interest grew from $100 to $500, how large was the sum 
at the middle of the period? The natural inclination is to take 
the arithmetic mean of $100 and $500, or $300. But if the sum 
increased to three times the original amount in half the period, 
it should increase to three times three, or nine times the original 
amount, in the whole period. We must take the geometric mean 
of the numbers 100 and 500, which is $223.00. If the original 
sum was multiplied by 2.236 in the first half, it must have been 
multiplied by 2.230 twice, or by 2.23G 2 , or by 5, in the whole 
period. 

Experience shows that many sorts of phenomena in many 
different sciences tend to grow geometrically. In such cases it 
is evident that the geometric mean will give the correct answer, 
while the arithmetic mean will not. We use the geometric mean, 
not because we like to play with logarithms, nor because it makes 
us appear sophisticated, but because it is accurate for these data. 
For other data it would be misleading. 

The student might try proving for himself that if the popula¬ 
tion of a city increases 30 per cent in 10 years it does not increase 
3 per cent each year. If it did, and we started with a population 
of X persons at the beginning, at the end of one year there would 
be 1.03X, at the end of the second year there would be 1.03 2 X, 
and at the end of the 10th year there would be 1.03 10 X, or 1.34X, 
and the population would have increased 34 per cent rather than 
30 per cent. This problem calls for another application of the 
geometric mean, using a formula for geometric progressions 
which is well known to all students of mathematics as applied to 
finance. Let us represent the amount at the beginning of our 
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period by B, and the amount at the end by E, while we represent 
the rate of increase per unit of time by r. Let n represent the 
number of units of time in our entire period. Then our formula 
becomes 



Applied to the problem which we have just attempted, the popu¬ 
lation at the beginning is X and at the end is 1.30-Y. The length 
of period is 10 years. Substituting in our formula, we get 

r = yj 1 "y! X - - ] = V'L30 - 1 

Using logarithms, we find that this yields 

r - 0.027 

Instead of an average rate of 3 per cent each year, it turns out 
that w T e had an average rate of increase of 2 7 per cent each year. 
Using the arithmetic mean gives us the wrong answer; using the 
geometric mean gives us the right answer. If the student will 
start with any number and multiply if by 1.027 ten times, he will 
find that he actually ends with a number 30 per cent larger than 
the one with which he started, proving the correctness of the 
geometric method. 

In addition to these cases of geometric rates of increase or 
decrease, in which the geometric mean must be used, we shall 
see in Chap. XIV that there are certain theoretical advantages 
in using the geometric mean when computing index numbers. 

Yet the geometric mean has serious disadvantages. Foremost, 
perhaps, is the fact that most people do not understand the 
results and are afraid of the method. While it is not really very 
difficult to compute, it seems so to the student who has a fear of 
logarithms. The statistician should use it, of course, in those 
cases to which it is adapted; but even here it would probably be 
better if he called his answer merely “the average” rather than 
“the geometric average.” If any value in the original series 
is zero, the geometric mean assumes a value of zero regardless 
of the sizes of the other items. If any value in the original series 
is negative, the geometric mean may be either negative or imagi¬ 
nary. In cases where the number of items in the series is even, 
there are always theoretically two possible values of the geometric 
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mean, one positive and one negative. For example, the square 
root of 16 is either +4 or —4. In such cases, however, we always 
take the positive value as the geometric mean. 

Some authors suggest that when a distribution has considerable 
positive skewness (as defined in Sec. 8.6), or when there is a 
definite lower limit to the values and no definite upper limit 
(as in a distribution of the numbers of families with various 
numbers of children, where zero is a lower limit and there is no 
upper limit) the geometric mean should be used in preference to 
the arithmetic mean. Other authors suggest that if the frequency 
distribution with logarithmic frequency classes (see Sec. 3.20) 
is more symmetrical than the corresponding distribution with 
equal class intervals the geometric mean should be used in prefer¬ 
ence to the harmonic mean. We can summarize, however, with 
the statement that the usual places to use the geometric average 
are cases involving average rates of increase or decrease, or cases 
involving the computation of index numbers. 

6.20. Advantages and Disadvantages of the Harmonic Mean.— 
The computation of the harmonic mean is somewhat more cum¬ 
bersome than that of the arithmetic mean, and, as with the 
geometric mean, we need some explanation of why anyone would 
want to compute such an average at all. Again the answer is 
that for certain rather unusual types of problems the harmonic 
mean is correct and other averages are incorrect. The har¬ 
monic mean is ordinarily used only in averaging certain kinds 
of rates, and even then only under certain conditions. Let us 
take an example. 

Mr. Sedgewick drives his car at the rate of 25 miles per hour, 
while Mr. Kinsey drives his car at the rate of 50 miles per hour. 
What is their average speed? The answer obviously depends 
on whether they drive for the same distance or for the same time. 
If they both drive the same distance, say 50 miles, Sedgewick 
takes 2 hr. and Kinsey takes 1 hr., or they take a total of 3 hr. 
for the 100 miles. This is an average of 33H miles per hour. 
If they both drive for the same time, say 1 hr., Sedgewick drives 
25 miles and Kinsey drives 50 miles, or they take 2 hr. for 75 
miles, or 37.5 miles per hour. It will be noted that in the one 
case the average speed was 33J^ and in the other case 37.5 miles 
per hour. Both answers are correct. The only difference is in 
whether we kept the time or the distance constant for the two 
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men. We should note, though, that one of the cases (the second) 
is the arithmetic mean of the two rates; while the other case (the 
first one) is the harmonic mean of the two rates. Our original 
figures were given as 25 and 50 miles per hour. In these original 
figures the time (hours) was constant and the distance (miles) 
varied. If we want the average with both men driving for the 
same time, we take the arithmetic mean of the rates. If we want 
the average with both men driving for the same distance, we 
take the harmonic mean of the two rates. 

We can generalize this rule by noticing that in every rate there 
is a variable and a constant term. For example, in miles per 
hour the hour is constant and the miles vary; in dollars per dozen 
the dozen is constant and the dollars vary; in the output per man 
the man is constant and the output varies. If we want to take an 
average of such rates, we must decide whether it is desired to 
keep constant in our average the factor that was constant in the 
rate (in which case we use the arithmetic mean of the rates), 
or the factor that was variable in the rate (in which case we 
use the harmonic mean). Two further examples follow as 
illustrations: 

1. Suppose we buy bananas at one store for $5 per bunch 
and in another store for $10 per bunch. What is the average 
expenditure? 

The rate is dollars per bunch, with the numbers of dollars vary¬ 
ing and the bunch constant. If we are to assume that we buy 
the same number of bunches at each store, we should use the 
arithmetic mean of $5 and $10 (since we are keeping constant 
the same factor, bunches, which is constant in the rate). If we 
are to assume that we spend the same amount of money at each 
store (say $50), we should use the harmonic mean of $5 and $10 
(since we are keeping constant the factor, dollars, which was 
variable in the rates). 

2. Mr. Jorgensen gets 12 miles to the gallon of gasoline with 
his car, and Mr. Gentry gets 18 miles to the gallon. What is 
the average gasoline consumption? 

In this case the gallon is constant and the miles vary. If we 
assume that both men drive the same distance we should take the 
harmonic mean of 12 and 18, finding an average of 14.4 miles 
to the gallon. If we assume that both men use the same number 
of gallons, we should take the arithmetic mean of 12 and 18, 
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getting an average of 15 miles per gallon. We can prove these 
statements by taking numerical examples. If the two men drive 
36 miles each (keeping the distance the same for both of them) 
Jorgensen will use 3 gal. and Gentry will use 2 gal. This makes 
5 gal. for 72 miles, or 14.4 miles per gallon. On the other hand, 
if the men drive until they have used 2 gal. apiece, Jorgensen 
will have driven 24 miles and Gentry will have driven 36 miles. 
This will make 60 miles for 4 gal., or 15 miles per gallon. We 
thus see that in one case the arithmetic mean gives us the right 
answer, while in the other case we have to use the harmonic mean. 

We can summarize again, by saying, then, that the harmonic 
mean is used in certain cases where we are finding the average of 
rates. Every rate is stated as a variable number of units of 
something per constant number of units of something else—as a 
variable number of miles per single unit of time. If, in taking 
our average, we keep constant the factor that was variable in the 
rate, we must use the harmonic mean. 

The harmonic mean also has the advantages that it is rigidly 
defined, and its value depends on the value of each item in the 
distribution. The results can be used for further mathematical 
computation. 

On the other hand, the concept is an unfamiliar one, difficult 
for the layman to understand, and somewhat more difficult to 
compute than the arithmetic mean. It is greatly influenced by 
extreme items, especially so by extremely small items. It cannot 
be computed at all if any item in the distribution is equal to 
zero. The statistician would do well to use some other average 
save in those cases, just described, where he cannot get the right 
answer without it. 

5.21. Advantages and Disadvantages of the Quadratic Mean. 

The use of the quadratic mean can best be illustrated in con¬ 
nection with an actual case in the next chapter. In certain 
probability problems it is theoretically important to deal with 
the squares of numbers, rather than with the numbers themselves. 
In such cases the quadratic mean is natural. In other cases, 
as in dealing with deviations from the arithmetic mean, we cannot 
take an arithmetic average of the deviations, because the sum 
of the deviations is always zero—the positive and negative devia¬ 
tions balancing each other (see footnote on page 89). If we 
square the deviations, however, they are all positive, and we can 
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take the arithmetic mean of the squares. In such a case, also, 
the quadratic mean is natural. But this average, although 
rigidly defined and amenable to further algebraic manipulation, 
is very greatly influenced by extremely large values, is somewhat 
more difficult to compute than the* arithmetic mean, and is not 
simple enough to be readily imderstood by the layman. There¬ 
fore the statistician uses it only where there is a real reason for it. 

6.22. Summary of the Averages. —The advanced student will 
learn for himself to use that discrimination in the choice of aver¬ 
ages which is a mark of statistical competence. The beginner 
may well follow the rule of using the arithmetic mean in prefer¬ 
ence to all other averages except in the following cases: 

1. When the distribution is badly skewed, consider the advisability of 
using the median or the mode. The mode is the harder to find and less 
reliable than the median, but is perhaps the most natural of all concepts of 
averages (see Secs. 5.17 and 5.18). 

2. When the distribution is U-shaped, use the two inodes. 

3. When the items form a geometric progression, use the geometric mean 
(see Sec. 5.19). 

4. When finding an average rate of growth or change over a period of 
time, use the geometric mean (see Sec. 5.19). 

5. When logarithmic frequency classes give a more symmetrical frequency 
polygon than equal class intervals, use the geometric mean (see Sec. 3.20). 

(5. When averaging rate's, and it is desired to keep constant in the average 
the factor that is variable in the rate, use the harmonic mean (see Sec. 5.20). 

7. For certain index-number problems, use the geometric mean (see 
Chap. XIV). 

8. Whenever there is any reason to believe that the arithmetic mean 
would be seriously misleading, on account of undue influence from extreme 
items or for other reasons, consider the advisability of using the median 
or the mode. 

5.23. Suggestions for Further Reading. —The student will find a complete 
discussion of the problems here treated, in a great deal greater detail, in 
Franz Zizek, “Statistical Averages,” Henry Holt and Company, Inc., New 
York, 1913. A short, but rather technical and mathematical, treatment 
can be found in “Handbook of Mathematical Statistics,” edited by II. L. 
Rietz, Houghton Mifflin Company, Boston, 1924. A number of interesting 
and useful mathematical theorems with regard to the various averages is 
treated by John F. Kenney m his “Mathematics of Statistics,” Chap. Ill, 
D. Van Nostrand Company, Inc., New York, 1939. 

EXERCISES 

1. If we have a, frequency distribution which is almost, but not quite, 
symmetrical, and the values of the mean and the mode are 27 and 29 lb., 
respectively, what will be the approximate value of the median? 
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2* When we are computing the median, why do we make a different 
assumption as to the location of items within frequency classes from that 
which we make when computing the mean? 

8, Table 5.9 shows the number of laborers in the bread departments of 
American bakeries who, in 1931, were receiving hourly wages of various 
amounts. 1 From this frequency table compute the mean and the median 
hourly wage. Compute the modal wage by each of the two methods 
explained in this chapter, and compare the results. Compute the quartiles, 
the seventh decile, and the seventh percentile. 

4. Compute the mean of the figures in the preceding exercise by the long 
and by the short method, timing the processes. Compute by the short 
method first, so that any advantage which may come from familiarity with 
the data will accrue to the long method. 

5. Compute the mean of the figures in Exercise 3 above by the short 
method, taking a different guessed mean from that used before. Note the 
identity of the results. Note also why it is best to take a guessed mean 
near the center of the table. 

Table 5.9.— Hourly Wages in Bread Departments, of Bakeries in 
United States, 1931 


Hourly 

Wage 

(cents) 

Number of 
Laborers 

0- 9.9 

1 

10-19.9 

58 

20-29.9 

206 

30-39.9 

442 

40-49.9 

478 

50-59.9 

294 

60-69.9 

79 

70-79.9 

20 

80-89.9 

2 


6. Below is a diagram representing an array of heights. The figures 
can be thought of as representing 12 men standing in line and arranged 
in order of height. Suppose that we were to consider the median as located 
at the item represented by N /2 instead of (N -J- l)/2. This would be the 
item. Locate the *%, or sixth, item and find how many men are on 

.Hillifilfl 

each sidp of it. Locate the (N -f- l)/2 item and see how many men are on 
each side of that. Locate likewise the quartiles as they would be if based 

1 Data from U.S. Bureau of Labor Statistics Bulletin 580,. Table 5, p. 11. 
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on N/ 4 and 3JV/4 instead of on (N -f- l)/4 and Z(N -f l)/4. See how 
many men each method puts in each quarter. The object of this exercise 
is to point out why we add unity in the formulas for median, quartiles, 
deciles, etc. 

7. Try applying the Charlier check to the data of Exercise 4. 

8 . Find the median of the data in Table 5.9. 

9. Find the mode of the data of Table 5.9, using as the basis your figures 
for the arithmetic mean and the median. 

10. Find the mode of the data of Table 5.9, using the method described 
in number 5 of Sec. 5.13. 

11 . Make an ogive of the data in Table 5.9. 

12. Find the median and the quartiles from the ogive made in the pre¬ 
ceding exercise. 

13. Find the geometric mean of the data in Table 5.9. 

14. Find the harmonic mean of the data in Table 5.9. 

15. Find the quadratic mean of the data in Table 5.9. 

16. List several other cases, similar to that mentioned in Sec. 5.17, where 
the median can be found for nonquantitative data. 

17. In a certain fraternity house there are 7 seniors whose average weight 
is 165 lb., 9 juniors with an average weight of 160 lb., 13 sophomores with 
an average weight of 152 lb., and 20 freshmen with an average weight of 
150 lb. What is the average weight of the 49 members of the fraternity? 
Use the arithmetic mean. 

18. If a defense bond costs $18.75 today, and if it matures in 10 years at 
$25, it has increased in value by 33 % per cent in 10 years. How much did 
it increase in value each year? In other words, what was the equivalent 
annual interest rate? 

19. Suppose that Mr. Carter pays 3 cents per kilowatt-hour for his 
electricity and Mr. Leonard pays 5 cents per kilowatt-hour. Mr. Carter 
uses 350 kw.-hr. and Mr. Leonard uses 300 kw.-hr. Find the average cost 
per kilowatt-hour. Note that the answer is neither the simple arithmetic 
nor the harmonic mean of 3 cents and 5 cents. 

20. Explain under what circumstances the average cost in the preceding 
exercise would have been the arithmetic mean of 3 cents and 5 cents. 

21. Explain under what circumstances the average cost in Exercise 19 
would have been the harmonic mean of 3 cents and 5 cents. 

22. What kind of average can one take of 3 cents and 5 cents in Exercise 
19 to get the correct answer? (The correct answer in Exercise 19 iB 3.923 
cents per kilowatt-hour.) 



CHAPTER VI 


MEASURES OF DISPERSION 

6.1. Variability.—In the preceding chapters we attempted to 
find single values which could he used to represent whole groups 
of values. We tried, for example, to summarize the heights of 
1000 students by saying that the median height was 175 cm. Yet 
a moment’s consideration will make it plain that two frequency 
distributions may have averages which are exactly alike, even 
though the distributions are in other respects decidedly dissimilar. 
That is to say, the average does not tell the whole story about the 
characteristics of the distribution. 

Suppose that we have three distributions, each containing 
five values. They are as follows: 

Distribution /. 120, 120, 120, 120, 120 
Distribution II. 11G, 118, 120, 122, 124 
Distribution III. 5, 17, 51, 140, 387 

The arithmetic mean of each distribution is 120; the medians of 
the first two distributions are also 120. Yet there are decided 
differences between the distributions. In the first distribution, 
either the mean or the median is a perfect figure for representing 
the values of the group; either average represents each individual 
item with complete accuracy. In the case of the second distribu¬ 
tion, either the mean or the median coincides with but one of 
the values. If we use it to represent any of the other values in 
the group, we shall have more or less error. However, the error is 
not great in any case, and the errors of overstatement are exactly 
balanced by the errors of understatement. In the third distribu¬ 
tion, neither the mean (120) nor the median (51) represents the 
items particularly well. The items are widely scattered, and 
many of them lie far from the mean or from the median or from 
any other single value which we might choose to represent them. 

Here, then, are three distributions with the same arithmetic 
mean, yet the distributions are markedly dissimilar. It would be 

126 



MEASURES OF DISPERSION 


127 


quite easy to illustrate with eases where the median or the mode 
was the same in a number of radically different distributions. 

One of the most noticeable differences between the three dis¬ 
tributions we have just used as illustrations is the great difference 
between the degrees of concentration of the values. In distribu¬ 
tion I the five values are identical; there is no divergence at all. 
In distribution II there is a small scatter of values, but on the 
whole they are bunched fairly close to each other. In distri¬ 
bution III there is a great dispersion of values with no tendency 
for items to fall close to any point of concentration. In this chap¬ 
ter we study measures which show the amount of dispersion 
among data. These measures are variously called measures of 
dispersion, measures of scatteration, measures of variability , and 
measures of variation. Looked at from the opposite point of view 
they could, of course, be considered measures of concentration or 
measures of congregation. The name is not particularly impor¬ 
tant, but the concept is. In this book the term “dispersion” 
is commonly used, since it has the advantage of most general 
adoption. 

6.2. The Range.—On page 67 are listed the marks received in 
an examination taken by 90 students. The marks are arranged 
in an array. It is fairly easy to see, by a glance at the array, that 
there is a considerable dispersion of values. One of the common¬ 
est measures of dispersion in popular use is evident from the data 
as they appear. This measure of dispersion is called the range , 
and is equal to the difference between the largest and the smallest 
values in the group. In the case of the examination marks, the 
highest mark was 206 and the lowest was 43. We find the range 
by subtracting the smallest from the largest, thus: 

Range = 206 - 43 - 163 

When we say that the range of the marks is 163, we obviously 
say something about the degree of their concentration. If, again, 
we were to compute the ranges of the three distributions on page 
126, we should find these values: 

I. 120 - 120 - 0 

II. 124 - 116 = 8 

III. 387 - 5 = 382 

It is evident that, ceteris paribus , the larger the range, the greater 
is the scatter of the values in the group. 
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When we attempt to determine the range of the items in a fre¬ 
quency table, we run into the difficulty of not knowing for certain 
the size of any item. We do not know the size of the largest or 
the size of the smallest item; hence we cannot determine accu¬ 
rately the difference between them. We can, however, tell 
approximately how large they are. If we go back to the figures 
on heights of Harvard students (page 83), we note that the small¬ 
est possible height (taken to the nearest unit) would be 155 cm. 
and the greatest possible height 199 cm. Thus a rough approxi¬ 
mation of the range would be 199 — 155 = 44 cm. We could, 
of course, take the two extreme class marks and call the difference 
between them the range. In this case it would be 198 — 156 = 
42 cm. Either approximation is good enough, since, as we shall 
now see, the value of the range is at best subject to considerable 
variation. 

The value of the range depends on but two items in the distri¬ 
bution: the largest and the smallest. Yet we have already noted 
(page 112) that it is at the extremes that chance variations are 
most noticeable and have the greatest effect. The largest item 
included in any group is largely a matter of chance. If we 
select groups of 1000 college students at random, there will be 
much less variation between the medians of the groups than 
between the extreme items. And the range is dependent entirely 
on the two extreme items—the two items that are above all sub¬ 
ject to chance fluctuation. On this account the range is itself 
very unstable. If it were not for this fact, the range would be an 
extremely useful measure of dispersion, because it is easily under¬ 
stood and easy to compute. But its instability is such a serious 
fault that it is seldom used as the measure of dispersion in work 
where care and accuracy count. Only if ease of popular com¬ 
prehension is more important than are exactitude and stability 
do we use the range. 

6.3. The Semi-interquartile Range. —In order to escape from 
the chance fluctuations which occur toward the extremes of fre¬ 
quency distributions, statisticians are likely to discard the 
extreme items and find the amount of variation in the central 
part of the data. It is common for them to discard the upper 
quarter and the lower quarter of the items, and to measure the 
range in the remaining central half of the items. Thus we can 
find the value of the third quartile and subtract therefrom the 
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value of the first quartile. This will give us the interquartile 
distance, or the interquartile range. For reasons that will appear 
later, statisticians more commonly use half of this distance as 
their measure of dispersion; that is, they subtract the value of the 
first quartile from that of the third quartile, and take half of the 
difference as their measure of scatter. Since we have previously 
computed the quartiles of several distributions, we can immedi¬ 
ately determine the value of the interquartile range, and, what is 
more useful, of one-half of it, that is, of the semi-interquartile 
range. 

In the case of students’ heights we have seen (page 103) that 
the first quartile of heights is 170.95 cm. The third quartile 
turns out to be 179.84 cm. The interquartile range is 

179.84 - 170.95 = 8.89 cm. 

The semi-interquartile range is one-half of this value, or 4.44. 
If we round off this value as before, we find that the semi-inter¬ 
quartile range is 4.4 cm. 

If we let the letter Q stand for the semi-interquartile range 
(since it has no subscript, it will not be confused with the symbols 
for the quartiles themselves), we can summarize our method of 
computing the semi-interquartile range by the formula 

n — (Q* “ Qi) 

H 2 

The interquartile range is obviously the range within which 
half of the items fall—the central half of the items. In the above 
example we found that the interquartile range was 9 cm. (8.89 
cm). This means that within a range of 9 cm. were to be found 
half of the students measured. When we divide this figure by 2, 
it is on the assumption that the distribution is approximately 
symmetrical. Patently the quartiles of a symmetrical distribu¬ 
tion will be equally distant from the median (and from the mean 
and the mode, since these three averages will coincide in a sym¬ 
metrical distribution). If we divide the interquartile range in 
half, we have the distance from the median down to the lower 
quartile and the distance from the median up to the upper quar¬ 
tile. Thus, if the distribution is symmetrical, the semi-inter- 
quartile range tells the distance we must go above and below the 
median to include half of the cases. In the present example 
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we may say, since Q = 4.5 cm., that by including all students 
whose heights are within 4.5 cm. of the median we shall include 
just half of the cases. The other half of the students will be 
more than 4.5 cm. removed from the average height. 

To illustrate again, we discovered on page 76 that the quartiles 
of the examination marks were as follows: Q x = 88.75; Qz = 
149.25. In this case the semi-interquartile range is 

« - . 30.25 

If the distribution were exactly symmetrical, the median would 
be just halfway between the quartiles, and removed by 30.25 
from each of them. Reference to page 76 will show that the 
median mark was actually 112, which is 23.25 from the lower 
quartile and 37.25 from the upper quartile. The average of 
these is (23.25 + 37.25)/2 = 60.5/2 = 30.25. This is the semi- 
interquartile range which we have already computed. 

Thus, in distributions where there is not complete symmetry, 
Q measures the average distance from the quartiles to the median. 
If we were given merely the median and the semi-interquar¬ 
tile range for these marks (that is, if we are told merely that 
Med. = 112 and Q — 30.25), we should be forced to interpret 
the latter measure in some such language as this: “If the distri¬ 
bution of marks is symmetrical, half of the marks are within 
30.25 of 112 and half of the marks are farther removed from 112 
than 30.25. At any rate, regardless of symmetry, if we discard 
the marks of the lowest quarter and also those of the upper quar¬ 
ter, the marks of the remaining half of the students will fall 
within a range of 60.5.” 

6.4. The Average Deviation. —There are, of course, some dis¬ 
advantages in discarding two quarters of the data in order that 
we may measure the dispersion of the remaining half. We should 
usually prefer some measure of dispersion based on all the items. 
It is obvious that we can get such a measure if we find how far 
each item is from the average, and then take an average of these 
deviations. Thus if we have the five values of distribution III 
on page 126, we can go about the process of measuring dispersion 
as follows: 

Distribution III. 5, 17, 51, 140, 387 

We have already noted that the mean of these items is 120. 
Now the first item differs from the mean by 115, the second item 
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differs by 103, the third by 69, the fourth by 20, and the fifth 
by 267. If we average these we get 

115 + 103 + 69 + 20 + 267 ^ 574 ^ n 
5 5 

This is the average amount by which the items differ from the 
mean, and is called the average deviation. 

Now it will be seen that we neglected the fact that some of the 
deviations were positive and some negative. As a matter of 
fact we should have stated the deviation of the first item as 
— 115 and that of the last item as +267. Unfortunately, if 
we kept the signs and added algebraically, the positive values 
and the negative values would cancel each other, since it is a 
characteristic of the arithmetic mean of any group of values that 
the algebraic sum of the deviations from the mean is zero. 1 
Hence in computing the average deviation we neglect the signs 
of the deviations and add their absolute values. 

If we are to give in a formula directions for computing the 
average deviation of ungrouped data, we shall need some symbol 
to represent the amount by which an item differs from the aver¬ 
age. It is customary to represent this deviation by small rather 
than by capital letters. Thus the amount by which any X 
differs from the mean of the X’s is represented by x . We can 
define this term by the equation 

x « X - X 


To summarize the method of computing the average deviation 
(which is itself symbolized by AD), we have 


AD 


S( N ) 

N 


The vertical lines beside the x mean that the signs are to be 
neglected—that we are to add the values of the deviations as 
though they were all positive or zero. The formula would be 
read as follows. 


1 In fact it is this characteristic of the mean which makes possible the 
computation of the mean by the short method presented on p. 87. In that 
method we guess at a mean and calculate the sum of the deviations. If 
this sum turns out to be zero, we know that our guessed mean coincides with 
the actual mean. If the sum of the deviations turns out, as it usually does, 
to be other than zero, we adjust our guessed mean to the point where the 
sum of the deviations will equal zero. 
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“The average deviation is equal to the sum of the absolute 
deviations of the items from their average, divided by the num¬ 
ber of cases.” 

The average deviation can be computed by another method 
which is somewhat better adapted to computation on calculating 
machines. This method 1 is summarized by the following 
formula: 

_ 2(BX - b) 

AD N 

where B = number of measures below the mean. 

b = sum of the items below the mean. 

These two symbols are not used in this sense in other statistical 
formulas, and need not be remembered except in so far as they 
apply to this specific problem. 

The problem of computing the average deviation from data 
grouped in frequency tables is simple. It can be done by a 
so-called “short method”; but in the case of the average devia¬ 
tion the time saved by this short method is not large and the 
method itself is so complicated that it would not pay to master it 
unless one were doing a good deal of work with the average devia¬ 
tion. We shall confine ourselves here to an exposition of the 
“long method,” the theory of which is easier to follow, and shall 
leave the interested student to acquaint himself with the other 
method if he desires. 2 

When we compute the average deviation by the long method, 
we determine the amount of deviation for the items of each class 
on the assumption that the items are concentrated at the class 
mark. We then find the average of these deviations, just as we 
find the average of any values that are grouped in a frequency 

1 Based on Trtjman Kelley, “Statistical Method,” pp. 70-75, The Mac¬ 
millan Company, New York, 1924. The notation is changed in this 
presentation. 

2 For expositions of the “short” method see, for example: Se crist, “An 
Introduction to Statistical Methods,” pp. 342/., The Macmillan Company, 
New York, 1929; Davies and Crowder, “Methods of Statistical Analysis 
in the Social Sciences,” John Wiley <fc Sons, Inc., New York, 1933; Garrett, 
“Statistics in Psychology and Education,” pp. 32/., Longmans, Green & 
Company, New York, 1926; Mills, “Statistical Methods Applied to Eco¬ 
nomics and Business,” pp. 152/., Henry Holt & Company, New York, 1924; 
Chaddock, “Principles and Methods of Statistics,” pp. 156/., Houghton 
Mifflin Company, Boston, 1925. 
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table. This means, of course, that we must start out by finding 
the value of the mean in order to find next the amounts of the 
deviations from the mean. 

Let us determine the average deviation of the heights of 
Harvard students (see Table 6.1). The class marks and the 
frequencies which we have used before appear in the first two 
columns. Then we determine how far each class mark is from 


Table 6.1.—Computation of Average Deviation of Heights of 
Harvard Students 


Class Mark 

(X) 

Frequency 

(f) 

Deviation from S. 

(*) 

/* 

156 

4 

-19.335 

- 77.340 

159 

8 

-16.335 

-130.680 

162 

26 

-13 335 

-346.710 

165 

53 

-10.335 

-547.755 

168 

89 

- 7.335 

-652.815 

171 

146 

- 4.335 

-632.910 

174 

188 

- 1.335 

-250.980 

177 

181 

1.665 

+301.365 

180 

125 

4.665 

583.125 

183 

92 

7 665 

705.180 

186 

60 

10 665 

639.900 

189 

22 

13.665 

300.630 

192 

4 

16.665 

66.660 

195 

1 

19.665 

19.665 

198 

| 1 

22.665 

22.665 

Total (neglect signs)... 

5278.380 


the true mean. We have discovered (see page 88) that the 
mean height of these students is 175.335 cm. If the 188 students 
in the class whose class mark is 174 cm. are to be considered 
as being concentrated at the class mark, each of them has a 
height of 174 cm. Each of them, that is, falls short of the mean 
by 1.335 cm., the value which appears opposite this class in the 
third column. Each figure in the third column shows the devia¬ 
tion of the corresponding class mark from the mean, which is 
175.335 cm. In other words, we subtract the true mean from 
the class mark to find the figures of this column. 

If 188 items differ from the mean by an amount of 1.335 cm. 
each, they deviate a total of 188 X 1.335 cm. == 250.98 cm. 
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This is the figure that appears opposite the class in the fourth 
column. The figures in the fourth column show for their respec¬ 
tive classes the total amount of the deviation when all items in 
such classes are considered. These figures are the products 
obtained by multiplying together the figures opposite them in 
the second and third columns. 

If each figure in the last column shows the total amount of 
deviation for the class in question, then the sum of this column 
(taken without regard to signs) will show the total amount of 
deviation in the whole distribution. In this case the total devi¬ 
ation of the 1000 items is 5278.380 cm. The average deviation 
is found, of course, by dividing this total deviation by the number 
of cases. Hence, if 2/x — 5278.380 and N = 1000, 

^ - V - tbP * 5 27838 

The average deviation, then is 5.3 cm. or 5 cm., depending on 
the amount to which we round it off. 

To summarize the steps involved in computing the average 
deviation from frequency tables: 


1. Compute the mean. 

2. Compute the deviation of each class mark from the mean by subtract¬ 
ing the mean from the class mark. 

3. Multiply the frequency of each class by the deviation of its class mark 
from the mean. 

4. Add the products just obtained, neglecting signs. 

5. Divide the sum just obtained by N. 

The summary in shorthand form is 

AD * 

What is meant when we say that the average deviation of 
student heights was 5 cm. ? It means that the students measured 
varied in height. Some were above and some below the mean; 
some were near the mean in height and some were far from it. 
But these students differed from the mean an average of 5 cm. 
If, in another group of people, the average deviation of heights 
was 7 cm., we should say that they varied more on the average 


gjjgj 

N 
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than did the Harvard group. In general, the larger the average 
deviation, the greater is the dispersion within the group. 

It can be demonstrated 1 that the average deviation is smaller 
when computed from the median than when computed from any 
other point. Without a rigorous demonstration, we point out 
that the sum of the distances from any two points to any point 
between them is constant, and less than the sum of the distances 
to any point not between them. Since an equal number of 
cases lie above and below the median, the cases can be paired 
and the deviations will be smaller in sum than deviations from 
any point other than this. 2 Now the fact that the sum of the 
absolute deviations is smaller when taken around the median 
than when taken around any other value is a good reason for 
basing the average deviation on the median rather than on the 
mean. Sometimes this is done. In the vast majority of cases, 
however, the AD is based on the mean, and if any other base is 
used the fact should be stated. In our illustrative examples 
here we have based our computations on the mean. The varia¬ 
tions which would be involved in basing the measure on the 
median are obvious. 

6.6. The Standard Deviation.—The standard deviation , or root- 
mean-square deviation , is by far the commonest and most useful 
measure of dispersion in technical work. The range, we have 
seen, is unstable on account of its dependence on items whose 
size is largely a matter of chance. The semi-interquartile range 
arbitrarily excludes half of the items from consideration. The 
average deviation neglects the fact that some deviations are 
negative and some positive, and it treats them all as positive. 
Although the average deviation is an extremely useful measure 
of dispersion and is easily explained to the layman, nevertheless 
the neglect of the signs of the deviations makes this measure of 
dispersion almost useless in further mathematical work. We 
desire some measure of variation which escapes these several 
faults, and to a considerable extent the standard deviation does 
so. 

In scientific work the standard deviation is always represented 
by the small Greek letter sigma (<r), and it is so commonly used 

1 See, for example, Kelley, op. cit. f p. 74. 

* SeeLoviTTand Hotzclaw, “Statistics,” p. 109, Prentice-Hall, Inc., New 
York, 1929. 
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that the statistician forms the habit of reading the symbol <r as 
“standard deviation” rather than as “sigma.” In some 
instances those who are more familiar with Greek than with 
statistics go to the other extreme of using the word “sigma” 
when they mean “standard deviation.” But if someone says 
that the “sigma” of a distribution is 14, it is safe to interpret 
his statement to mean that the standard deviation of the distri¬ 
bution is 14. At any rate it would commonly be written 

cr = 14 

The standard deviation, like the average deviation, is based 
on deviations from the mean. Of course, it was this base that 
got us into trouble in the case of the average deviation, since 
we had to neglect the signs of some of the deviations. In com¬ 
puting the standard deviation, however, we get around this 
difficulty by taking the quadratic mean of the deviations rather 
than their arithmetic mean. The student will recall that when 
we take a quadratic mean we square all the original figures. 
This makes them all positive, and we need not neglect signs. 
The standard deviation, then, is the square root of the arithmetic 
mean of the squares of the deviations. This description of it 
alone is enough so that the student should be able to go ahead 
with the computation by himself. We give, however, examples 
of the computation. 

Table 6.2.— Computation of Standard Deviation 


X 

X 

x * 

5 

-4.9 

24.01 

8 

-1.9 

3 61 

13 

+3.1 

9 61 

12 

+2.1 

4 41 

11.6 

+ 1.6 

1 2.56 

49 5 


44.20 


6.6. Standard Deviation: Ungrouped Data. —When data are 
not grouped, we proceed exactly in accordance with the directions 
given above. Suppose we take the following items: 

5; 8; 13; 12; 11.6 
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The total of these five items is 49.5, and their average is 



Let us compute their standard deviation (see Table 6.2). In 
the second column is given the difference between each value 
and the average, and the squares of these differences appear in 
the third column. We have thus found the sum of the squared 
deviations to be 

2(x 2 ) « 44.20 

Since there are five deviations, the average of the squared devia¬ 
tions is found by dividing by 5, thus: 

5g2 . , 8.84 

N 5 

And the square root of the average is 

' = s[W = v'Oi 

The standard deviation, then, is 2.97. 
it can be summarized thus: 

J. Find the mean. 

2. Find for each item the deviation from the mean by subtracting the 
mean from the item. 

3. Square these deviations. 

4. Add the squares just obtained. 

5. Divide the sum just obtained by N. 

6. Take the square root of the quotient just found. 

Or, if we want the directions in shorthand form, 



This method of finding the standard deviation from ungrouped 
data is correct, but another method is usually somewhat shorter 
in application and gives exactly the same results. The directions 
for this preferred method are 

1. Square the original figures. 

2. Add these squares. 

3. Divide this sum by N. 

4. Subtract from this quotient the square of the mean. 

5. Take the square root of the difference. 


= 2.97 

The process of finding 
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The formula in this case becomes 1 


<r 


4 


2(X 2 ) 

N 


X s 


Table 6.3.— Computation of Standard Deviation 


X 

X* 

5 

25 

8 

64 

13 

169 

12 

144 

11.5 

132 25 

49.5 

534 25 


2(X 2 ) = 534.25 


S (X 2 ) 

N 


= 106.85 


X 2 = 9.9 2 = 98.01 

1^5 - X 2 = 106.85 - 98.01 
N 

c = V8.84 = 2.97 


8.84 


If we illustrate with the same five values that we used before 
the process becomes as shown in Table 6.3, above. This is 
precisely the same answer that wc found before. The work 


1 The equivalence of these two formulas for <r may be seen from the 
following: 


V 1 


: (z 2 ) 

K 


But, since x is the deviation of X from the mean, wc have 
x - X - X 



a = 
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of computation involved here seems as long as that of the earlier 
method. For such a short example, and for one in which the 
mean happens to be a number with but two digits, it is perhaps 
as long. But let the student try the ordinary example, in which 
the mean turns out to be a number with 5 to 50 decimals, such 
as 12.396724. Try taking the deviation of each item from such 
a mean. Try squaring these deviations. Try adding the 
squares and taking the square root. In such a case any method 
which involves merely the squaring of the original figures without 
the taking of deviations is a blessing. It will be noted that 
capital X's rather than small x’s are used in the second formula 
to indicate that it is based on the original values rather than on 
the deviations of these values from the mean. 

6.7. Standard Deviation: Grouped Data. — When we have data 
in frequency tables, we can compute the standard deviation by 
either the long or the short method. In this case the short 
method lives up to its name and is a considerable timesaver. 
Below the long method is explained first, so that the student 
may see the reasons for the steps involved. We then illustrate 
and explain the short method as applied to the same data, so 
that the student may see where the savings in time are made. 
In order that we shall have all measures of dispersion on the 
same data for purposes of comparison, we illustrate again with 
the frequency table showing the heights of Harvard students. 
The pertinent parts of the table appear again in Table 6.4 with 
other information which is now needed. 

It is necessary first to compute the mean of the heights. We 
have discovered earlier that the mean height is 175.335 cm. 
(see page 88). Hence we state our class marks at the left of the 
table, and in the second column we state each class mark as a 
deviation from the mean. For example, the first class contains 
items whose values are 156 cm. (under our assumption that 
these items are concentrated at the class mark). But 156 cm. 
falls short of the mean by 19.335 cm. This value is listed 
in the second column. Opposite each other class mark is listed 
also its deviation from the mean, found by subtracting the mean 
from the class mark. These figures show for each class the 
amount by which each item in the class deviates from the mean. 

We have seen that the standard deviation is based on the 
squares of such deviations; hence the third column shows 
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the squares of the values in the second column. In other 
words, the square of the deviation for each item in the class 
is stated opposite each class. Then follows a column, with 
which we are familiar, showing the number of cases in each 
class. In the first class there are four items, and the squared 
deviation of each is 373.842225. Hence the total squared 


Table 6.4.— Computation of Standard Deviation from Grouped Data 
(Long Method) 


Class 

Mark 

(*) 

Deviation 
from Mean 

(*) 

3 * 

/ 

/** 

156 

-19.335 

373.842225 

4 

1,495.368900 

159 

—16.335 

266.832225 

8 

2,134 657800 

162 

-13 335 

177.822225 

26 

4,623.377850 

165 

-10.335 

106.812225 

53 

5,661.047925 

168 

- 7.335 

53.802225 

89 

4,788.398025 

171 

- 4.335 

18.792225 

146 

2,743.664850 

174 

- 1.335 

1.782225 

188 

335.058300 

177 

1.665 

2.772225 

181 

501.772725 

180 

4.665 

21.762225 

125 

2,720 278125 

183 

7.665 

58.752225 

92 

5,405.204700 

186 

10.665 

113.742225 

60 

6,824 533500 

189 

13.665 

186.732225 

22 

4,108.108950 

192 

16 665 

277.722225 

4 

1,110.888900 

195 

19.665 

386.712225 

1 

386.712225 

198 

22.665 

513.702225 

1 

513 702225 

Totals. 


1000 

43,352.774600 


deviation of these four items is 4 X 373.842225 = 1495.3G8900. 
Similarly we find the total of the squared deviations for the 
other classes by multiplying the squared deviation of column 3 
by the number of cases listed in column 4. This gives us the last 
column, which is headed fx 2 . (In adding this column to get the 
sum of the squared deviations for the distribution, it is not 
necessary to neglect signs, since all the values became positive 
when we squared the values of column 2 to get the values in 
column 3.) The sum of the last column, then, gives us the sum 
of the squared deviations for the distribution. We discovered 
earlier that we must now divide this sum by the number of cases 
and take the square root of the quotient. These operations give 
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43,352.774600 _ 
1000 

<r = 


43.3527746 
y/ 43^3527746 


6.584 


We have thus found the standard deviation of the heights of 
students to be 6.584 cm. Rounding it off, we have 6.6 cm. or 
7 cm. 1 


Table 6.5.—Computation of Standard Deviation from Grouped Data 
(Short Method) 


Class 

Mark 

( x) 

Frequency 

(f) 

Class 

Deviation 

id) 

fd 

/# 

156 

4 

-7 

- 28 

196 

159 

8 

-6 

- 48 

288 

162 

26 

-5 

-130 

650 

166 

53 

-4 

-212 

848 

168 

89 

-3 

-267 

801 

171 

146 

-2 

-292 

584 

174 

188 

-1 

-188 

188 

177 

181 

0 

0 

0 

180 

125 

1 

125 

125 

183 

92 

2 

184 

368 

186 

60 

3 

180 

540 

189 

22 

4 

88 

352 

192 

4 

5 

20 

100 

195 

1 

6 

6 

36 

198 

1 

7 

7 

49 

Totals. 

1000 


-555 

5125 


The process through which we have just derived the standard 
deviation is tedious and time-consuming. Fortunately an alter¬ 
native process is quick and easy. This short method of com¬ 
puting the standard deviation will now be explained, the same 
data being used for purposes of comparison. The short process 
is much like the short process of discovering the mean of grouped 
data, and, since we are using the same data which we then used, 
it may pay the student to review the section explaining that 
process in conjunction with the present exposition. We repeat 

1 In the reference from which these figures are taken we are told that 
<r 6.56 cm. The reader will recall (see p. 86) that our figure for the 
mean also differed from that of the original reference. 
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the necessary figures on height and add those data which are 
necessary to compute the standard deviation in Table 6.5. 

The student will remember that the computation of the mean 
by the short method was based on the practice of guessing at the 
mean, taking deviations from the guessed mean in units of the 
class interval, and making the necessary adjustment to compen¬ 
sate for the error in the guessed mean. In the short method 
for computing the standard deviation we follow a parallel proce¬ 
dure, and in our illustration we shift our guessed mean one class 
from its former position so that the student may see that the 
position of the guessed mean is of no importance (save as it 
minimizes work if it is near the large frequencies). 

If the mean had not already been computed from these data 
we could compute it now, although it is not necessary. 1 In 
the short method we proceed directly to the computation of the 
standard deviation itself. Having listed the class marks and 
the frequencies as before, we next guess at a mean, selecting 
always one of the class marks near the center of the distribution 
where the frequencies are large. In this case we have assumed 
that the mean is 177 cm. We have then, in the third column, 
stated the deviations in class intervals from the assumed mean. 
The first class is seven classes below the assumed mean, so we 
label it —7; the next class is —6; etc. Our figure for the first 
class means that each of the four items in that class is seven 
classes below the assumed mean; the figure for the second class 
means that each of, the eight items in that class is six classes 
below the assumed mean; etc. We now multiply these class 
deviations by the frequencies in their respective classes. Each 
of the four items in the first class is —7 deviations from the mean, 
so that this class totals —7X4= —28 class deviations from 
the mean. Similarly for the other figures in the fourth column 
It is important that the computer keep track of signs in this 
work, for each class whose class mark is smaller than the assumed 
mean has a negative deviation. 

1 Substituting the values from the table into the formula for the mean 
(p. 90), we get 

X - 177 + 3 ( J j^jf) - 177 - 1.665 = 175.335 

This is the identical result that we found when we took the guessed mean at 
another point. 
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Finally we get the last, or fifth, eolumn in the table by mul¬ 
tiplying the figures in the fourth column by the figures in the 
third (or the d) column. Since these figures are already the 
product of / and d , and since we now multiply them by d again, 
they are equal to /d 2 . It is evident that this second multiplica¬ 
tion by d will make all the signs positive, so that we neglect no 
signs. 

We now add the three columns headed /, /d, and /d 2 . These 
totals are needed for the computation of the standard deviation. 
The total of the / column we know already is 2/ = N = 1000. 
In adding the column of /d’s to get 2(/d), it is important to 
keep track of the signs. In this case we find that 2(/d) — —555. 
We also find that 2(/d 2 ) = 5125. To find the standard deviation 
from these figures we go through the following steps: 

1. Divide 2(/d J ) by N. 

2. Divide 2 (fd) by N, and square the quotient. 

3. Subtract the second result from the first. 

4. Take the square root of this difference. 

5. Multiply the square root by the class interval. 

In formula form this is 



Substituting the values of our present problem, we,have 

5125 7 - 555V 

1000 \J()00 J 

= 3 \/5.125 - 0.308025 = 3 V4.81G975 
= 3(2.194) = 0.582 

Thus we find that the standard deviation of the students' heights 
is 6.582 cm. Comparison with the answer obtained by the long 
method shows a discrepancy in the third decimal place: this 
results from the fact that we have dropped decimal places. In 
fact, we drop more decimal places in the long than in the short 
method, since in the former there are more decimal places to drop. 

The notation used in the formula for the short method is the 
same as that already used in computing the mean (see page 90). 

6.8. Checking Accuracy of Computations.—We noted in Sec. 5.4 (see 
page 91) that there are ways in which the statistician is able to check the 




144 


ELEMENTS OF STATISTICAL METHOD 


accuracy of hia arithmetic in some computations. We applied such a 
method in the case of the arithmetic mean. The Charlier check can also 
be applied in the case of the standard deviation. Table 6.6 is exactly the 
same as Table 6.6 except that a new column has been added at the extreme 
right. This new column contains values of f(d 4* 1)*. To find these 
values, we add the number 1 to each value in the column headed ( d ). This 
gives us values of (d 4- 1). We square these values and multiply the squares 
by the frequencies in the column headed (/). For example, the top figure 
in the (d) column is — 7. If we add 1 we get —6. This value squared 
gives us 36. We multiply 36 by 4 (the value of /) to get 144, the first figure 
in the new last column. Similarly the fifth figure from the end of the 
column (960) is found by adding 1 to the value of d to get 3 + 1=4, squar¬ 
ing to get 16, and multiplying by 60, the frequency, to get 960. 

Having obtained the numbers in the last column, we add them, getting 
a total of 6016. We now apply the Charlier check, which consists of the 
equation 

2[/(d + 1)*] = 2(/d 2 ) + 2 2 (fd) + 2/ 

This means that the sum of the last column should be equal to the sum of the 
(/) column plus the sum of the (/d 2 ) column plus twice the sum of the (fd) 
column. 1 In our example 2 [f(d + l) 2 ] is 6015; 2 (fd 2 ) is 5125, 2 (fd) is 
—555; and 2/ is 1000. Substituting these values in the Charlier equation, 
we get 

5015 = 5125 + 2( —555) + 1000 
5015 - 5125 - 1110 + 1000 
5015 - 5015 

This proves that our arithmetical work was correct. 

We shall discover in a later chapter that our standard deviation computed 
from a frequency table is inaccurate for another reason—namely, because 
of our assumption that the items within any given frequency class are all 
equal to the class mark of that class. This assumption involves relatively 
little error when one is computing the arithmetic mean, but it involves a 
biassed error, always in the same direction, when one computes the standard 
deviation, and always gives a value for the standard deviation which is too 
large. This error is discussed in Sec. 8.4, page 194. There it will be seen 
that the error is usually a very small one, and that the answer obtained 
by the methods discussed here iB reasonably dependable. 

6.9. Meaning of the Standard Deviation.— We have found that 
the standard deviation of the heights of 1000 Harvard students 
is 6.6 cm. (rounded off from 6.582 cm). As in the other measures 
of dispersion, the larger the value of the standard deviation, the 

1 The proof is simple. 

(d + 1)* = d* + 2d + 1 
f(d + l) 2 «/d 2 +2/d +/ 

2[/(d + I) 2 ] « 2(/d 2 ) + 22 (fd) + 2/ 
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less closely grouped are the items. A large standard deviation 
means that the items are widely scattered. Under ordinary 
circumstances the range, the semi-interquartile range, the average 
deviation, and the standard deviation differ in size. The semi- 
interquartile range is usually the smallest, followed by the average 
deviation, the standard deviation, and the range, in the order 


Table 6.6.—Charlier Check for the Standard Deviation—Short 

Method 


Class 

Mark 

Frequency 

Class 

Deviation 




(X) 

w 

(d) 

t fd ) 

(fd*) 

f(d + 1)» 

156 

4 

-7 

- 28 

196 

144 

159 

8 

-6 

- 48 

288 

200 

162 

26 

-5 

-130 

650 

416 

165 

53 

-4 

-212 

848 

477 

168 

89 

-3 

-267 

801 

356 

171 

146 

-2 

-292 

584 

146 

174 

188 

-1 

-188 

188 

0 

177 

181 

0 

0 

0 

181 

180 

125 

1 

125 

125 

500 

183 

92 

2 

184 

368 

828 

186 

60 

3 

180 

540 

960 

189 

22 

4 

88 

352 

550 

192 

4 

5 

20 

100 

144 

195 

1 

6 

6 

36 

49 

198 

1 

7 

7 

49 

64 

Totals 

~"io'oo 


1 -555 

1 

5125 | 

5015 


named. In those cases where we have what is called a “ normal ” 
distribution 1 the sizes of the measures of dispersion bear a definite 
and known relationship, and in these distributions the semi- 
interquartile range is about two-thirds of the standard deviation 
and the average deviation is about four-fifths of the standard 
deviation. (More exact values are Q = 0.6745(7; AD = 0.7979.) 
If we compare the measures which we have now computed for 
students* heights, we find the following: 

Q - 4.44 cm. (page 129) 

AD - 5.28 cm. (page 134) 

<r = 6.58 cm. (page 143) 

1 See Chap. VII for a description of the normal distribution and for a more 
complete description of the interrelationships of the measures of dispersion. 
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It will be noted that the values appear in the order we have just 
indicated. Moreover, we see that in this case 


Q - 

AD = 


(«S)' - 0<s74fo 

(IS)' - » a ' 3 ' 


Thus while these measures do not have exactly the relative size 
that they would have in a normal distribution, they have approxi¬ 
mately that relative size. 

It is also true that in a normal distribution about two-thirds 
of all the items in the distribution 1 will fall within one standard 
deviation, and practically all the items within three standard 
deviations of the mean. (We have seen that 50 per cent of the 
items fall within Q of the mean, and in a normal distribution 
57.5 per cent of the items fall within AD of the mean. This 
gives us another basis for comparing these measures of dis¬ 
persion.) Thus if the heights of the Harvard men are normally 
distributed, we should expect that two-thirds of them have 
heights between 175.335 + 6.582 and 175.335 — 6.582 cm; that 
is, between the points which lie at a distance of one standard 
deviation on each side of the mean. In our problem this will 
mean between 181.917 cm. and 168.753 cm. The entire range of 
a distribution will ordinarily, then, lie within the three standard 
deviations above and the three standard deviations below the 
arithmetic mean—an over-all distance of six standard deviations. 
We discussed in Sec. 3.15 the problems involved in deciding how 
many classes to use in a frequency table, and how large the class 
interval should be. Fisher states 2 that, while grouping in 
frequency classes brings perforce some inaccuracy, nevertheless 
the error in estimating values from a normal population will be 
less than 1 per cent if the class interval docs not exceed one 
quarter of a standard deviation. If we think of the entire 
distribution as being spread over six standard deviations, with a 
class interval of one quarter of a standard deviation, we see that 
this rule would require the use of approximately 24 classes to 


1 Actually 68.27 per cent of the cases will fall within \<r and 99.7 per cent 
within 3 cr. See Chap. VII for a more complete discussion. 

* R. A. Fisher, “Statistical Methods for Research Workers," 3d cd , 
p. 50, Oliver & Boyd, Edinburgh and London, 1932. 
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include the bulk of the cases in many distributions. In practice, 
however, the number of classes is seldom so large. 

We can, then, interpret the standard deviation in this way. 
When we are told that the standard deviation of heights is 

6.6 cm., we know that the dispersion is less than it would be 
in a group where cr = 10 cm. and more than in a group where 
<r *= 2 cm. We know that, if the distribution of heights is 
about normal, approximately two-thirds of the items in the 
group will be within one standard deviation of the mean, or, 
in this case, within 6.6 cm. of the mean. We know that prac¬ 
tically all the cases will be within three standard deviations, or 
19.8 cm., of the mean. Practically never will we find a height 
less than 175.335 — 19.8, and practically never one more than 
175.335 + 19*8. An inspection of the original data on heights 
will show that these statements on extremes of height hold good 
in this distribution. 

6.10. Variance.—While the standard deviation is the most 
useful single measure of dispersion, we shall find it desirable in 
some later applications to measure dispersion by the square of 
the standard deviation rather than by the standard deviation 
itself (see Chap. X). The square of the standard deviation 
of any distribution is called the variance of the distribution. 
When we found that the standard deviation of student heights 
was 6.58 cm., we could have squared the answer to find that the 
variance was 43.30 sq. cm. We note that the variance is in 
square measure—square centimeters rather than centimeters. 

Obviously we can find the variance of any distribution from 
the formulas for the standard deviation given in Secs. 6.6 and 

6.7 by omitting the radicals in those formulas, although in the 
formula of Sec. 6.7 we shall also have to multiply by the square 
of the class interval rather than by the class interval itself. The 
directions for finding the variance of a frequency distribution are: 

1. Divide 2(/d 2 ) by N. 

2. Divide 2(/d) by N, and square the quotient. 

3. Subtract the second result from the first. 

4. Multiply the difference by the square of the class interval. 

The real advantage of the variance as a measure of dispersion 
will become apparent in our later work. In the meantime let us 
recall that if we have computed the arithmetic means of a number 
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of subgroups, and wish to combine them into a single group, we 
may compute the arithmetic mean of the combined group directly 
from the arithmetic means of the subgroups. The method was 
described in Sec. 5.16. Similarly it is possible, when we combine 
subgroups, to compute the variance of the combined group 
directly from the variances of the constituent subgroups. Let 
n i be the number of cases in the first subgroup and let n 2 be the 
number of cases in the second subgroup, with N being the total 
number of cases in the combined group. N = rii + n 2 . Let 
Xi be the arithmetic mean of the items in the first subgroup, 
in the second subgroup, and X in the combined group. Let d\ 
be the difference between the arithmetic mean of the first sub¬ 
group and the arithmetic mean of the combined group. 

di - Xi - 1 

Let d 2 be the corresponding difference between the mean of the 
second subgroup and the mean of the combined groups. Then 
we have the relationship 

_ n\V\ + n 2 v 2 4“ nidi 2 -f- n 2 d 2 2 
v N 

where Vi and v 2 are the variances of the first and the second sub¬ 
groups and v is the variance of the combined group. It will be 
seen that the variance of the large group is the weighted arith¬ 
metic mean of the variances of the subgroups plus the weighted 
arithmetic mean of the squared differences between the averages 
of the subgroups and the large group. This can be put in another 
form to show that the variance of the large group is the sum of 
two parts. 

1. The weighted arithmetic mean of the variances of the subgroups. 

2. The variance of the means of the subgroups themselves. 

This fact is extremely important in the analysis of variance , one 
of the most powerful of the recently perfected statistical tools. 
The subject is discussed in an elementary way in Chap. X. 

Before we leave the subject, let us illustrate the computation 
of the standard deviation on a major group from the data on the 
subgroups. Suppose we have in a given school 72 boys with an 
average height of 68 in. and a standard deviation of 3 in. In 
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the same school are 38 girls with an average height of 61 in. and 
a standard deviation of 2 in. What is the standard deviation in 
the heights of all 110 people in the school? We find that the 
average height for ail the people in the school is 65.58 in. (using 
the method explained in Sec. 5.16). The average for the boys 
exceeds this by 2.42 in., while the average of the girls falls below 
it by 4.58 in. We therefore substitute in the formula as follows 
(remembering that if the standard deviation of the boys’ heights 
is 3 in. the variance is 9 in., etc.): 

72(9) + 38(4) + 72(2.42 2 ) + 38(-4.58 2 ) 
v 110 

Carrying out the required computations, we find that v — 18.35 
or that a = y/ 18.35 = 4.28 in. Thus we know that if we throw 
the two groups together the standard deviation of the combined 
groups will be 4.28 in. We can carry this process out to any 
number of subgroups, merely adding in our numerator for any 
new group, x y the values n x v x and n x d x 2 } and increasing our 
denominator to include the sum of all the cases in all the sub¬ 
groups. We could give the directions for any number of sub¬ 
groups as follows: 1 

1. Multiply the variance in each subgroup by the number of cases in 
that subgroup. 

2. Add these products for all subgroups. 

3. Square the difference between the mean of each subgroup and the 
mean of the large group, then multiply this square for each subgroup by the 
number of cases in the group. 

4. Add these products for all subgroups. 

5. Add the sums found in steps 2 and 3 above. 

6. Divide the sum in step 5 by the total number of cases in all groups. 
The quotient will be the variance of the large group. Its square root will 
be the standard deviation of the large group. 

6 . 11 * Measurement of Relative Dispersion. — The measures of 
dispersion which we have treated are called “absolute” measures 
of dispersion. The results are expressed in the same units as 
the original data; that is, the standard deviation is 6.6 centimeters 
or 7.4 dollars or 534 foot-pounds. The standard deviation is 
expressed (as are the other measures of dispersion as well) in the 
units in which the X values were originally stated. There is 

1 For proof, see John F. Kenney, “ Mathematics of Statistics,” pp. 95-97, 
D. Van Nostrand Company, Inc., New York, 1939. 



150 


ELEMENTS OF STATISTICAL METHOD 


nothing in the answer to show whether the standard deviation 
is large or small. We might well have two distributions with 
the same standard deviation, say a standard deviation of 1 ft., 
and yet in the one case this might be a very large dispersion and 
in the other case a very small one. How is this possible? 

Suppose we illustrate. Imagine that we measure the lengths 
of the main-line track of the railroad systems of the United 
States. We find the length of each line and then compute the 
standard deviation in the lengths. A standard deviation of 
1 ft. would be unbelievably small. It would mean that a con¬ 
siderable number of the railroads were within 1 ft. of the average 
length, and that almost no railroad differed from the average 
length by more than 1 yd. Suppose, on the other hand, that 
we measure the lengths of the noses of 500 college seniors and 
find a standard deviation of 1 ft. Is this large or small? It is 
obviously large. It means that we might expect about one- 
third of our seniors to have noses which differed from the average 
length by as much as 1 ft.! In both of these cases the standard 
deviation is 1 ft., yet in one case it is unbelievably small and in 
the other impossibly large. This example illustrates the fact 
that the absolute size of the measure of dispersion does not tell 
us in itself whether the dispersion is large or small. 

But if these measures of dispersion cannot tell us what we 
want to know, how can we find out? Let us set up another 
problem. Suppose you were told that some keeper of a zoo 
had weighed, at one time or another, 150 newborn black bears. 
The weights were analyzed, and it was found that there was a 
standard deviation in the weights of lb. Is this a large or a 
small dispersion? Suppose we were told that the standard 
deviation in the weight of newborn babies is also Yz lb. 1 Would 
you think that the dispersion in the weights of the bear cubs 
was greater or smaller than that in the weights of the babies? 
In other words, the question becomes a relative one. Is a 
J^-lb. dispersion relatively large or relatively small? To what 
shall we relate the dispersion? 

In practice we use as measures of relative dispersion a com¬ 
parison between the mean and the measure of dispersion. A 
standard deviation of 1 ft. in lengths of railroad systems is 

1 Neither of the standard deviations given in this paragraph is based on 
actual figures. Both are hypothetical. 
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small when compared with the average length of railroad systems; 
but a standard deviation of 1 ft, in the length of noses is large 
when compared with the average length of noses . If we are to know 
whether a standard deviation of Yi lb. in weights at birth is large 
or small, we must know the average weight at birth. It is 
said that the average male human weighs about 7.5 lb. at birth 1 
and the average black bear cub comes into the world weighing 
about 10.5 oz. 2 Thus if the young of these two animals have 
the same dispersion in weights, the human babies are relatively 
much less variable than the bear cubs. 

The simplest and most obvious method of stating a measure of 
dispersion in relative terms which compare it with the mean is to 
state it as a percentage of the mean. This is the way in which 
all measures of relative dispersion are computed. We have 
discovered that the semi-interquartile range of student heights 
is 4.44 cm. (page 129). The mean height is 175.335 cm. (page 
88). If we wish to compute relative dispersion based on the 
semi-interquartile range for this distribution, we get it in this 
rn {inner: 


10Q(Q) 
X " 


444 

175.335 


2.53 per cent 


Measures of relative dispersion are always given in percentage 
terms and always show the percentage which the measure of 
absolute dispersion is of the average. The average used is 
almost always the mean; if any other average is used it should 
be specified. 

Any measure of absolute dispersion can be converted into a 
measure of relative dispersion by stating it as a percentage of 
the mean. The formula would be 


100 (absolute dispersion) . ,. 

-^-- i -= relative dispersion 

Average 

A large relative dispersion does not mean that the values are 
widely scattered absolutely, but that they are widely scattered 
as compared with the mean. 

1 L. E. Holt, “Care and Feeding of Children,” p. 33, Appleton-Century- 
Crofts, Inc., New York, 1928. 

2 E. T. Seton, “Lives of Game Animals,” Vol. II, Pt. 1, p. 174, Doubleday 
& Company, Inc., New York, 1929. 
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Although any measure of dispersion can be used in con¬ 
junction with any average in the computing of relative dispersion, 
statisticians in fact almost always use the standard deviation as 
their measure of dispersion (see page 135) and the arithmetic 
mean as their average. When the relative dispersion is stated 
in terms of the arithmetic mean and the standard deviation, 
the resulting percentage is known as the coefficient of variation , 
or the coefficient of variability. This coefficient is symbolized 
by the letter V, defined thus: 



If we take the hypothetical cases of bear and human weights 
which we used above, we can now compute the coefficients of 
variation: 

Human babies: Bear cubs: 

Mean weight = 7.5 lb. Mean = 0.65G lb. 

cr of weights = 0.5 lb. <r — 0.5 lb. 

50 

V = — = 6.7 per cent V — 76.2 per cent 

7.5 

By comparing the two coefficients of variation, we discover that 
bear cubs are (in this hypothetical case) relatively much more 
variable in weights at birth than are human babies, even though 
their absolute variabilities are identical. 

For a final illustration let us compute from the problem we 
have been studying the coefficient of variation of student heights. 
Here we have 

X = 175.335 cm. (page 88) 

<7 = 6.582 cm. (page 143) 

Tr 658.2 

V = = 3.75 per cent 

175.335 

Even now one does not know whether a standard deviation which 
is 3.75 per cent of the mean shows a large or a small scatter. 
One can judge this only by comparing it with other scatters. 
A relative dispersion which would be thought to be very large in 
one field might be considered small in another. Those sciences 
which have developed relatively great accuracy in measurement 
can ordinarily produce much smaller percentages of dispersion 
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than can be found in those fields where measurement is still crude 
and approximate. 

Coefficients of relative (rather than abolute) variability are 
used when: 

1. The series to be compared are stated in different and noncomparable 
units. For example, if the standard deviation of heights is 6.6 cm. and the 
standard deviation of weights is 11.9 kg., which represents the greater 
variability? We cannot compare centimeters and kilograms. But we 
can say that the coefficient of variation in height is 3.75 per cent and in 
weight is 18.1 per cent. This comparison would show considerably more 
variability in weight than in height, at least in this group of students. 

2. The series, although stated in the same units, differ so in their average 
magnitudes that we should ordinarily expect much more absolute variation 
in the one than in the other. We have pointed out that one should expect 
more variation in the lengths of railroads than in the lengths of noses, even 
though both are measured in the same units (feet). 

6.12. Suggestions for Further Reading.—A good mathematical treatment 
of the problems involved in dispersion is found in John F. Kenney, u Mathe¬ 
matics of Statistics,” Chap. V, D. Van Nostrand Company, Inc., New York, 
1939. George R. Davies and Walter F. Crowder, in their “ Methods of 
Statistical Analysis in the Social Sciences,” John Wiley & Sons, Inc., New 
York, 1933, discuss variations in the computation of the standard deviation 
which are skewed in logarithmic form. Truman L. Kelley, in his " Statistical 
Method,” The Macmillan Company, New York, 1924, described certain 
theoretical advantages in using the range between the 10th and the 90th 
percentiles as a measure of dispersion. 

EXERCISES 

1. Compute the standard deviation and the coefficient of variation of the 

wages given in Table 5.9, page 124. * 

2. Measurements of 1017 freshmen women at Hollins College from 1920 
to 1927 show that the mean height was 63.86 in. and the standard deviation 
of heights was 2.09 in. The mean weight of these same students was 
115.65 lb., with a standard deviation of 15.78 lb. Compute the two coeffi¬ 
cients of variation. Were these students more variable in height or in 
weight? Were they more or less variable in height than the Harvard 
students? 1 2 

3. A group of 100 selected Smith students averaged 163.8 cm. in height, 
with a coefficient of variation of 3.3 per cent.* What was the standard 
deviation in their heights? 

4 . A study of 129 mothers showed that the average age of the mother 

1 Data from Palmer, Physical Measurement of Hollins Freshmen, 
Journal of The American Statistical Association, Vol. 24, No. 165, March, 
1929, pp. 42-45. 

2 Palmer, op. cit. t p. 42. 
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at the time her first child was bom was 23.9 years. The standard deviation 
in ages was 5.39 years. 1 What was the coefficient of variation? Was 
there more or less variation in mothers' ages at the birth of first-born than 
in heights of Harvard students? 

5. The average number of offspring in 55 completed families was 3.55. 
The standard deviation was 1.79. What was the coefficient of variation? 2 

6 . A study of 22,498 divorces which took place in Wisconsin from 1887 
to 1906 shows that the average duration of the marriage which preceded 
the divorce was 10.37 years. The standard deviation was 8.39 years. The 
corresponding figures for the 2,651 divorces of 1929 were X « 9.83 years 
and <r *» 8.26 years. Had there been an increase or a decrease in the 
variability of marriage duration? 3 

7. A group of men were tested with respect to the strength of grip in their 
right hands. The average was 48.9 kg., and the standard deviation was 
1.94 kg. 4 * Compute the coefficient of variability. 

8 . Apply the Charlier check to your computation of the standard devia¬ 
tion in Exercise 1. 

9. The 85 girls who entered Hollins college in 1920 had an average height 
of 63.24 in. with a standard deviation of 2.35 m. The 125 girls who entered 
in 1921 had an average height of 63.74 in. with a standard deviation of 
1.77 in. 6 What was the standard deviation in the entire group of 210 girls 
for the two years combined? 

1 Conrad and Jones, Field Study of Differential Birth Rate, Journal of 
the American Statistical Association, Vol. 27, No. 178, June, 1932, p. 158. 

2 Ibid. 

3 Young and Deprick, Variation of Duration of Marriages Which End 
in Divorce, Journal of the American Statistical Association , Vol. 27, No 178, 
June, 1932, p. 161. 

4 Benedict et al. } Human Vitality and Efficiency under Prolonged 

Restricted Diet, Carnegie Institution Publication 280, p 583. 

6 Palmer, loc . cit. « 
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7.1. Probability.—Suppose that you have a bag in which there 
are 25 white balls and 75 black balls. Suppose that 4 the balls 
are well mixed, and you draw one ball at random from the bag. 
What is the probability that the ball selected will be white? 
There are evidently 25 chances that you will be successful and 
75 chances that you will fail, or 100 chances in all. If we let 
s represent the number of ways in which you can succeed and 
/ the number of ways in which you can fail, and if these ways 
are equally likely, then we say that the probability of success is 


s _ s 
s + f n 

and the probability of failure is 

/ _ / 

s + f n 


In our illustration the probability of success would be 


s = 25 
n 100 


0.25 


and the probability of failure would be f/n = 7 ^oo s 0.75. 
In other words the probability of the occurrence of an event 
is the relative number of times which we would expect it to occur 
in an infinitely large number of trials. 

The probability of success is usually symbolized by the letter p, 
and the probability of failure by the letter q. It should be 
obvious that 

p + q ~ —r~> H- 

^ s + f s + / 

=: L±i 
s+f 
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In other words, the probability that an event will either happen 
or fail to happen is represented by the figure 1, which therefore 
stands for absolute certainty. Impossibility would be repre¬ 
sented by the figure 0. Chances between absolute certainty and 
impossibility would be represented by some decimal between 
0 and 1. It is also evident that if we know either p or q the value 
of the other can be calculated at once from the relationship 
p + q « 1. 

We have illustrated the probability concept with a case 
(the drawing of balls from a bag) in which one can reason out 
the probable results without experiment. To be sure, the 
reasoning depends on the past experience of the reasoner, and 
to this extent it would be incorrect to say that the result is 
based on reason rather than on experience. But it is true that 
one can come to some conclusions with regard to probabilities in 
such cases without carrying out experiments for the specific pur¬ 
pose of measuring the probability. In such cases, where we state 
the probability as a product of our reasoning, we call the result 
the a priori probability. 

In statistical work we have little contact with problems 
involving a priori probability except in those cases where we are 
deriving and illustrating theory. Most actual statistical prob¬ 
lems are so complicated that no one can reason out the expected 
results. For example, what is the probability that a child under 
one year of age who has whooping cough will recover? No 
amount of reasoning will tell us the answer. There are too many 
variables involved, and their relationships are too obscure. 
In such cases we fall back on the experience which we have had 
with the problem. The Minnesota State Department of Health 
stated that 50.5 per cent of children under one year of age recover 
from whooping cough and 49.5 per cent die. 1 Thus we can say 
that the probability of recovery is 50.5 cases out of 100, or 
50.5/100, or 0.505. The probability is usually stated in the 
latter form. The likelihood of failure to recover (death) would 
similarly be 0.495. These facts would be stated thus: 

p = 0.505 

q = 0.495 

1 Quoted in Faegre and Anderson, “Child Care and Training," p. 48, 
University of Minnesota Press, Minneapolis, 1930. 
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Probability of this kind, which is based on records of past 
performance rather than on pure reasoning, is called statistical 
'probability or empirical probability . One cannot rely on such 
probability except on the assumption that the past performances 
which form the basis of calculations were typical, and similar 
to what can be expected in the future. Thus one would have 
to be sure, before one used this figure for the probability of 
recovery from whooping cough, that the figures of past per¬ 
formance on which the estimate of probability is based were 
records of typical cases. If these figures were taken during an 
exceptionally severe or unusually light epidemic, or if the 
children were subjected to some particular type of medical care, 
or if in any way the cases differed from other cases to which we 
might wish to apply the probabilities, then these statistical 
probability figures might lead us astray. 

7.2. Mean and Standard Deviation of Probability Data. —If, 
on the other hand, we can assume that the basic data from 
which we compute statistical probability are typical, then 
probability figures will be very useful in the solving of statistical 
problems. . Suppose that an epidemic of whooping cough breaks 
out in our community, and suppose that we can take the statisti¬ 
cal probability worked out from the Minnesota cases (p = 0.505) 
as being applicable to local conditions. There are, let us say, 
55 children in the community who are afflicted and who are 
under one year of age. How many will recover? We cannot 
tell with certainty, of course; sometimes more will recover and 
sometimes less. But on the average we should expect that 
0.505(55) will recover; that is, the average number of recoveries 
will be np = 0.505(55) = 27.775. In the average occurrence 
of 55 cases, therefore, we should expect 28 children to recover 
and 27 to die. 

We have, then, a very simple way of finding the average 
occurrence when the probability is known. If 10 cards are 
drawn at random from a well-shuffled pack of 52 cards, how many 
black cards will be among them? Sometimes we shall find more 
and sometimes less. Table 4.1, page 69, shows that when the 
experiment was actually tried 102 times, the number of black 
cards varied from 1 to 10. But what should one expect on the 
average in such cases? The total number of cards in the pack is 
52. Of these the 13 spades and the 13 clubs, making a total of 
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26 cards, are black. Thus the probability (a priori) of drawing 
a black card is 2 %2 — 0.5. We are to draw 10 cards. N, then, 
is 10. On the average we should expect to draw 

np = (10) (0.5) =* 5 black cards 

A glance at the table on page 69 will show that in these trials 
the average was very close to 5 black cards out of 10. 

But to be told that we should expect 27 children with whooping 
cough to die and 28 to recover, on the average, under the cir¬ 
cumstances previously mentioned, is not enough. We have 
just seen that one can expect to draw 5 black cards out of 10 
on the average , but the table also shows that on one of the drawings 
10 black cards were drawn. Is it not well within the realm of 
chance, then, that all the children will recover from whooping 
cough, or that they will all die? We see that, on the average, 
the recoveries and deaths will almost balance, but what are the 
chances of departure from this average? 

This is the same question that was raised in the preceding 
chapter on Dispersion. We saw there (page 126) that we do 
not by any means obtain a complete description of a frequency 
distribution from the mean. We need to know also something 
about the dispersion. In the case of deaths from whooping 
cough we want to know not only the average number that may 
be expected to live, but the dispersion of the numbers that will 
live. We have seen that, for a sample of size n, the average 
number of successes will be np. It can be demonstrated that 
the standard deviation of the number of successes will be y/npq. 1 
Thus if we take our most recent example, in which 55 babies 
were afflicted with whooping cough, we have already seen that 
on the average 28 of them (27.775) would recover. It is now 
apparent that the standard deviation of recoveries will be 
y/npq * VW) (0.505) (0.495) = \/T375 = 3.7. We can there¬ 
fore say that in about two-thirds of such cases the number of 
recoveries will not differ from the average by more than 3.7, 
and that practically never will the number of recoveries differ 
from the average by over 3(3.7), or 11.1. This means, then, 
that, in two-thirds of the cases when 55 babies have whooping 
cough, between 27.775 + 3.7 and 27.775 — 3.7 will recover. 

1 For proof see Richardson, “Introduction to Statistical Analysis/’ 
pp. 228-229, Harcourt, Brace and Company, New York, 1934. 



SIMPLE PROBABILITY AND THE NORMAL CURVE 159 


Carrying through the computations, we discover that the number 
of recoveries will run between 24.1 and 31.4 in two-thirds of the 
cases. The chances are two to one that the number of recoveries 
will be between 24 and 31. And we have also discovered that 
one almost never finds a value over 3<r from the mean. Here 
3<r = 3(3.7) = 11.1. We should almost never get more recoveries 
than 27.775 + 11.1, and almost never fewer than 27.775 — 11.1 
recoveries. Practically, then, the greatest number of recoveries 
that can reasonably be expected (if these cases are like those on 
which the statistical probabilities were computed) is 38.9, and 
the smallest number that can reasonably be expected is 16.7. 
We now know a great deal more about the likelihood of recoveries 
than was known when we knew merely that the average outcome 
would be 28 recoveries and 27 deaths. We shall come back to 
this problem again at a later point in this chapter. 

7.3. Elementary Theorems.—Up to this point we have been 
talking about the likelihood that one single thing will happen or 
fail to happen. What are the chances when two or more things 
are combined? Here we have two or three simple theorems 
which are demonstrated in every book on elementary algebra. 
They are merely listed and illustrated here; the student whose 
memory of them is hazy can refresh his mind from any good 
algebra. 

1. Events are said to be independent if the occurrence of one 
of them does not affect the occurrence of others. They are 
said to be dependent if the occurrence of the others is affected 
by the occurrence of the one. They are said to be mutually 
exclusive when, if one of them happens on a particular occasion, 
the other cannot happen. 

2. The probability that two or more independent events will 
all happen on a given occasion is the product of their separate 
probabilities. Thus, if we toss two pennies the chance that 
either will come up heads is 34 * The probablity that both will 
come up heads is 34 X 34 = 34- 

3. The probability that one or another of several mutually 
exclusive events will happen on a given occasion is the sum of 
their separate probabilities. Thus the probability of drawing 
an ace from a shuffled deck of cards on a single draw is %% = Hs- 
The chance of drawing a king is likewise 34 3 , and this is also 
the probability of drawing a queen. What are the chances of 
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drawing an ace or a king or a queen on a single draw? The 
probability is the sum of the separate probabilities: 

Ms + Ms + Ms = Ms 


If p is the chance of success on any trial, and we make n trials, 
the probability that the event will occur exactly r times (and fail 
n — r times) is 


n! 

r!(n — r)! 


p T q n ~ r 


If we draw a card from a shuffled pack, reinsert it, shuffle, draw a 
second card, and repeat the process until we have drawn 4 cards 
in this manner, what is the probability that we shall get exactly 
2 black cards in the 4 draws? Substituting in the formula, we 
get 



Three times out of 8 (on the average) we should get exactly 2 
black cards in 4 draws. 

7.4. Expansion of the Point Binomial.—Suppose we toss a 
single coin. There are two possible ways for it to fall (excluding 
the possibility that it will fall on its edge), and these we can 
symbolize by H for heads and T for tails. The possible results 
are, then 

1 H IT 


If we throw two coins, they both can fall heads (this we can repre¬ 
sent by HH ); or the first can fall heads and the second tails ( HT ); 
or the first can fall tails and the second heads ( TH ); or both can 
fall tails ( TT ). Unless we had the coins numbered or otherwise 
distinguished, the second and third of these possible occurrences 
would appear to be identical; that is, we would have two ways in 
each of which we could get one tail and one head. We could 
summarize our possible results thus: 

HH HT TT 
TH 


Or, to put them in another form, we could write 


1 HH + 2 HT 4- ITT 



SIMPLE PROBABILITY AND THE NORMAL CURVE 161 


If we throw three coins, the possible results are (using similar 
symbols) 

HHH HHT HTT TTT 
HTH THT 
THH TTH 


In the other form this would become (if H 2 T means 2 heads and 
1 tail) 1 H* + SH 2 T + SHT 2 + T z . With four coins the possi¬ 
bilities are 


HHHH HHHT 

HHTT 

HTTT 

HHTH 

HTTH 

THTT 

HTHH 

TTHH 

TTHT 

THHH 

THHT 

TTTH 


HTHT 



THTH 



That is, the results are 


TTTT 


1 H* + 4 H*T + 6 H 2 T 2 + 4HT* + 1 H* 

Finally, if we try the experiment with five coins we discover these 
possibilities: 


HHHHH HHHHT 

HHHTT 

HHTTT 

HTTTT TTTTT 

IIII HTH 

HHTTH 

HTHTT 

THTTT 

HHTHH 

HTTHH 

HTTHT 

TTHTT 

HTHHH 

TTHHH 

HTTTH 

TTTHT 

THIIHH 

1IHTHT 

THHTT 

TTTTH 


HTHTH 

THTHT 



TUT HE 

THTTH 



THIIHT 

TTHHT 



THHTH 

TTHTH 



HTHHT 

TTTHH 



This becomes H b + 5 H*T + 10// 3 F 2 + 10 H 2 T Z + 5OT 4 + 1 T 6 . 

The observing reader will note that the summary formulas 
which we are obtaining are the same results that would be 
obtained by raising the binomial (H + T) to higher and higher 
powers. Thus: 

(H + T) ~ H + T 
(H + T)* - + 2HT + T* 

(H + T)' « H* + 3 H*T + 3 HT* + T* 

( H -f Ty - H* 4- 4 H*T + 6 H'T* + 4HT* + T 4 
etc. 

Thus by expanding the binomial we can get the same results at 
once that we would get from long experiment. 
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Elementary books on algebra give rules for the expansion of the 
binomial to higher powers. 1 By following these rules one obtains 
the proper coefficients and exponents for any power of the 
binomial. 

Perhaps the simplest of these rules is the following: 

To find the terms of the expansion of (q + p) n : 

a. The first term is q n . « 

b. The second term is nq n ~'p. 

c. In each succeeding term the power of q is reduced by 1 and the power 
of p is increased by 1. 

d. The coefficient of any term is found by multiplying the coefficient of 
the preceding term by the power of q in that preceding term, and dividing 
the product so obtained by one more than the power of p in that preceding 
term. 

Example: 

{q 4- pY 33 <Z 6 + 4- 15 q*p* 4* 20g 3 p 3 -f 15 q*p A 4* 6gp s 4" P % 

We notice that, in accordance with rule o, the first term is q n or g®. We 
notice that, in accordance with rule b, the second term is nq n ~ l p or 6 q 6 p. 
The third term finds the power of q reduced by 1 and the power of p increased 
by 1 to give p 4 g 2 , and the coefficient is found m accordance with rule d , 
namely, we multiply the coefficient of the preceding term (G) by its power 
of q (5) and divide by one more than the power of p(l -f 1 — 2) to get 
6(5)/2 - 15. 

We can also get these results quickly from Pascal's arithmetical 
triangle, part of which is given in Table 7.1. It will be noted 
that the figures in this table proceed in accordance with a definite 
rule The first column consists of nothing but l’s. The second 
column is the arithmetical progression 1, 2, 3, 4, . . . , and starts 
at the second row. Each number in the table is the sum of the 
number above it and the number to the left of that number. In 
other words, we add to a given number the number at its left 
and put the sum below the given number in the triangle. In the 
next-to-last row of the table, for example, appears the number 20. 
It is found by adding the number above it (10) and the number 
to the left of that number (10). Note that the rows in this 
triangle give the coefficients of the coin-tossing experiment. 

^ee, for example, Fite, "College Algebra,” p. 150, D. C. Heath and 
Company, Boston, 1913; Rietz and Crathorne, "College Algebra,” p. 93, 
Henry Holt and Company, New York, 1919, Wilczynski and Slaught, 
p. 142, "College Algebra,” Allyn & Bacon, Boston, 1916; Griffin, "Intro¬ 
duction to Mathematical Analysis,” p. 431, Houghton Mifflin Company, 
Boston, 1921. 
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There is always one more term in the expanded binomial than 
the number of coins tossed (or the number of equally likely 
independent events). With two coins there are three possible 
occurrences: two heads, one head, or no heads. Hence we look 
for the row in the table with one more term than the number 
of coins. We note that the expansion with three terms has the 
coefficients 1, 2, and 1. Thus we know that the relative number 
of occurrences of the possible outcomes of tossing two coins are: 
two heads once, one head twice, and no heads once. To be sure, 
these results would be experienced only in the long run. 


Table 7.1. —Coefficients of tiie Binomial Expansion 



It will be noted that as we add more and more terms to the 
binomial expansion [that is, as we raise (H + T) to higher and 
higher powers], we continue to have values which are small 
toward the extremes, get larger and larger as we approach the 
center, and exhibit absolute symmetry. If we raise the binomial 
to the 14th power, giving 15 terms, they are 1, 14, 91, 364, 1001, 
2002, 3003, 3432, 3003, 2002, 1001, 364, 91, 14, and 1. If we 
plot these on a frequency graph, we get the chart shown in 
Fig. 7.1. It will be noted that the chart exhibits absolute 
symmetry and regularity, and that there is a peak of high fre¬ 
quencies at the center from which the frequencies fall away 
toward the ends. The slope of the curve is at first gentle as we 
leave the peak, gets steeper and steeper for a time, and then 
slowly tends to level out. It never becomes quite level, but as 
it approaches the base line it becomes more and more nearly so. 

7.6. The Normal Curve. —If the binomial were raised to an 
infinitely high power, the number of coefficients would become 
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infinitely large, and the short straight lines of Fig. 7.1 would 
m%ge into a continuous, smooth curve. This curve, which is 
the limit approached as the binomial is raised to higher and 
higher powers, is called the normal curve of error , or more usually 
merely the normal curve. It is likewise variously called the 
Gaussian curve, the Laplacian curve, the probability curve, and 
the normal distribution curve. The general expansion of 



Number of heads 


Fig. 7.1. Coefficients of (a + 6) 14 , giving the numbers of times that various 
numbers of heads would be expected to appear in 16,384 throws of 14 coins. 

(p + q) n is, called the point binomial , and in the special case 
where p = q = Yz and n is infinitely large we get the normal 
curve. In other words, the normal curve is a special case of 
the point binomial which we have when an infinitely large number 
of forces are operating, each of which is equally likely to happen 
or to fail. 
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It has been found in practice that the point binomial describes 
tolerably well many natural occurrences. It has been found 
especially that many phenomena of biology, economics, pay* 
chology, education, etc., even though not exactly normal in 
distribution, can be described roughly by the normal curve or 
some other point-binomial curve. To be sure, one seldom meets 
an actual distribution that is exactly symmetrical or is exactly 
normal in any other way—but likewise one seldom sees a trend 
that is perfectly described by a straight line or by a second-degree 
parabola. The normal curve is found in practice to be a con¬ 
venient method of smoothing out chance irregularities which 
occur in a frequency distribution, without departing in too great 
a degree .from the underlying characteristics of the original data. 

We have already noted the fact that many frequency distribu¬ 
tions tend to have small numbers of cases near the extremes and 
many cases toward the center (see page 41). The heights of 
Harvard students, with which we have now become so familiar, 
were so distributed. It is common for such data as physical 
measurements to be so arranged. In fact, this type of distribu¬ 
tion is so common that some people have come to look on it as 
normal and call it the “ normal distribution.” It should be 
emphasized that in most statistical problems there is no a priori 
reason for expecting normality of distribution—no reason for 
believing in advance that the data will be distributed as are the 
coefficients of the expansion + }i) n . But so many groups of 
data are distributed in this manner that the characteristics of 
such a distribution become especially important. It becomes 
worth our while to study this “normal distribution” so that we 
shall know what it is like. Then, in those many cases where the 
binomial expansion does approximately describe the data, we 
shall know better how to handle the problem. In more advanced 
statistical work, other forms of the point binomial become impor¬ 
tant (cases where p 9 * q) f but we shall confine ourselves in this 
chapter to a discussion of the most important case, that where 
p = q = which we call the normal curve. 

First let us describe the normal curve. It is pictured in Fig. 
7.2. It should be noted that it is entirely symmetrical bilat¬ 
erally. There is § high point exactly at the center, and the 
heights (frequencies) grow less and less toward the extremes. 
The slope grows steeper and steeper for a time as we progress 
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toward the ends, and then the slope becomes less and less. We 
say technically that there is a “point of inflection” on each side 
of the curve-^-that is, a point where the slope ceases to become 
steeper and begins to become more gradual. Students of the 
calculus will recognize this peculiarity better if we say that 



Fig. 7.2. The normal curve. The extremities are not shown, since the 
curve continues in either direction indefinitely. 


there is a change in the sign of the second derivative of the curve 
at these points. If we let the height of the curve be represented 
by y> and distances along the horizontal axis measured from the 
mean of the X’s be represented by x (that is, x is a deviation from 
the mean), the mathematical equation of the curve is 


V = 



e —*V2<r> 


In this equation a represents the standard deviation of the X’s, 
t is the ratio of the circumference of a circle to its diameter, and 
e is the basis of the Napierian system of logarithms and is equal 
to approximately 2.71828. This curve is asymptotic at the base; 
that is, it approaches closer and closer to the base line, but never 
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quite reaches it. The horizontal distance from the center of the 
curve (which represents the mean, the median, and the mode) 
to either point of inflection is equal to one standard deviation. 
If we drop from the points of inflection lines perpendicular to the 
base, these two lines, the base line, and the curve will enclose 
68.27 per cent of the entire area under the curve. Perpendiculars 
erected at twice this distance from the mean (that is, a distance 
of 2<r) will, together with the base and the curve, enclose 95.5 per 
cent of the area under the curve. If the perpendiculars are 
moved to points which are 3<r each side of the mean, the area 
referred to will be 99.7 per cent of the total area under the curve. 
It is on the basis of these facts that the statements on page 146, 
relative to the interpretation of the standard deviation, were 
made. 

7 . 6 . Areas under the Normal Curve.—It is possible to compute 

the percentage of the total area under the curve which will be cut 
off by perpendiculars erected at any number of standard devia¬ 
tions from the mean. The number of cases, the mean, and the 
standard deviation give a complete description of any curve 
which is really normal, and if we know these three values we can 
reconstruct the entire curve. This has made it possible to con¬ 
struct tables showing the percentage of the total area which falls 
within given numbers of standard deviations from the mean. 
Table 7.2 is a short example of this kind. A somewhat longer 
one appears in Appendix I (see page 514). 

To find the portion of the area under the curve which lies 
between the mean and any other point we proceed as follows: 
Suppose we desire to find the portion of the area between the 
mean and a point which is removed from the mean by 1.7 stand¬ 
ard deviations. We look for the column headed “1 standard 
deviation,*■ 1 and we look in the row opposite the entry 7 in the 
left-hand column (which lists tenths of standard deviations). 
We find the entry 0.4554. This means that 45.54 pet cent of the 
total area of the curve lies between the mean and a point either 
1.7cr above the mean or 1.7a- below the mean. Hence 2(45.54) 
or 91.18 per cent of the area will be within 1.7<r of the mean. 
Since the area of the curve represents the total number of cases 
in the distribution, we can say that if the values are normally 
distributed 91 per cent of them will lie within 1.7a- of the mean 
(see Fig. 7.3). 
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Table 7.2.—Relative Areas under the Normal Curve between the 
Mean and Various Numbers of Standard Deviations 


Tenths of 

a tr 

Whole Standard Deviations 

0 

1 

2 

3 

0 

0.0000 

0.3414 

0.4773 

0.4986 

1 

0.0398 

0.3643 


0.4990 

2 

0 0793 

0.3849 

0.4861 

0.4993 

3 

0.1179 

0.4032 

0.4893 

0.4995 

4 

0 1554 

0.4192 

0.4918 

0 4997 

5 

0.1915 

0.4332 

0.4938 

0 4998 

6 

0.2258 

0.4452 

0.4953 

0.4998 

7 

0.2580 


0.4965 

0.4999 

8 



0 4974 

0.4999 

9 

0.3159 

0.4713 

0.4981 

0.5000 



Fig. 7.3. The normal curve with perpendiculars erected at points 1.7 
standard deviations each side of the arithmetic mean. The shaded area, 
enclosed by the basic line, the perpendiculars, and the curve, is 91 per cent 
of the total area under the curve. 

This table is of great help in the interpretation of statistical 
conclusions. We shall, therefore, use it to aid in interpreting 
two more examples. 

We discovered (page 88) that the mean height of Harvard 
students is 175.335 cm. The standard deviation of their heights 
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is 6.6 cm. (page 141). How likely is it that a student chosen at 
random will exceed 185 cm. in height? We attack this problem 
thus: The question is, what is the probability that a value will 
exceed the mean by 9.665 cm. ? That is, how likely is it that a 
value will be as much as 9.665/6.6 = 1.46a above the mean? If 
the heights are normally distributed, 50 per cent of them will fall 
short of the mean. And our table tells us that between the mean 
and a point 1 .5a from the mean will be another 43.32 per cent of 
the cases. (We could get somewhat more accurate figures from 
the table in Appendix I, which shows that 42.8 per cent of the area 
falls between the mean and a point 1.46a from the mean. We 
shall use the shorter table here, however, and round off our devia¬ 
tion from 1.46a to 1.5a.) Thus if we include all the area from a 
point 1.5a above the mean on down, we include 50 + 43.32 = 93.32 
per cent of the cases. We can say, then, that in only 7 per cent of 
the cases will a student chosen at random exceed a height of 
185 cm. 

Let us go back to the whooping-cough problem which we met 
early in this chapter (page 156). We discovered that, when 55 
babies less than a year old are afflicted, on the average 27 deaths 
and 28 recoveries will result. We also discovered that the stand¬ 
ard deviation in the number of recoveries is 3.7. How likely is it 
that as few as 22 babies will recover? 

Our procedure is just as before. We shall outline it here. 

1. What is the mean? (28 recoveries) 

2. What is the standard deviation? (3.7) 

3. What is the point, about which we want information? (22 recoveries) 

4. How far is it from the mean? (28 — 22 = 6) 

5. How many standard deviations is it from the mean? (6/3.7 «** 1.62) 

6. What per cent of the cases lie between this point and the mean? 
(44.52 per cent) 1 

7. What per cent of the cases lie the other side of the mean? (Always 
50 per cent) 

8. This makes a total of what per cent of the cases? (50 + 44.5 ** 94.5 
per cent) 

9. How likely is the occurrence mentioned? It will happen in 5.5 per 
cent of the cases and fail in 94.5 per cent of the cases; that is, in 55 cases 
out of 1000 we should find fewer than 22 recoveries. In 945 cases out of 

1 Taken from the table on p. 168 and as 1.6<r from the mean. Actually 
there are 44.7 per cent of the cases between the mean and 1.62<r (see Appendix 
I, p. 514). 
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1000 we should find more recoveries. This, of course, would happen only 
on the average. 

It is now seen that the standard deviation is an extremely 
valuable measure in conjunction with a frequency distribution 
if the distribution is normal or approximately so. 1 Under such 
circumstances we can tell what percentage of the total cases will 
fall within given numbers of standard deviations from the mean. 
It must be remembered that deviations are always measured in 
units of the standard deviation. 

This fact—that all normal curves can be described in such 
terms—makes it possible to compare some measures which could 
not otherwise be compared. We give but one example, but 
others will quickly suggest themselves. Suppose that John 
scores 127 on a test on which the average score is 112 and the 
standard deviation of scores is 15. Robert scores 98 on a test 
on which the average score is 90 and the standard deviation is 3. 
Who makes the better score? 

It is immediately obvious that we cannot say that John makes 
the better score, merely because his score was higher. He took 
the easier test, as is shown by the fact that the average score was 
higher. We note next that John is 15 points above the average 
for his test, while Robert is but 8 points above the average for the 
test that he took. But again we cannot say that this proves that 
John is better, because there was a great deal more variation on 
John’s test than on Robert’s: the standard deviation is much 
higher. We must find out how far each deviates from the mean 
in standard units , that is, in units of the standard deviation. If 
we do this we find that John is = la above the mean on his 

test, and Robert is % — 2.67 a above average on his test. This 
shows considerably better performance on Robert’s part than on 
John’s. If the distributions of marks are normal, John’s mark is 
exceeded by 16 per cent of those who took his test and Robert’s 
score is exceeded by but 0.35 per cent of those who took his test. 
The student should verify these figures for himself, using Table 7.2 
or Appendix I. 

1 The figures here given are for perfectly normal distributions, but Salvosa 
has published tables similar to these showing areas under the ordinates of 
curves of varying degrees of asymmetry. See Luis R, Salvosa, Tables of 
Pearson’s Type III Function, Annals of Mathematical Statistics , Vol. 1, 
May, 1930, pp. 
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If we wish to find the distance which, when laid off above and 
below the mean, will include half the area under the curve, we 
look in the table on page 168 or in the table in Appendix I, page 
514. We hunt for the point which, when laid off on one side of 
the mean, will include 25 per cent of the cases (because, since the 
curve is symmetrical, this distance both sides of the mean will 
include 2 X 25 per cent = 50 per cent of the cases). We dis¬ 
cover that we need to go between 0.67<r and 0.68cr to reach this 



Fig. 7.4. Perpendiculars erected under the normal curve at distances ot 
0.6745 standard deviation on each side of the mean. The area enclosed by 
the base line, the perpendiculars, and the curve is one-half of the total area 
under the curve. 

point. As a matter of fact, it is necessary to go 0.6745 a from the 
mean in each direction in order to enclose half of the area. But 
it is to be remembered that the semi-interquartile range, when 
laid off on each side of the mean in a symmetrical distribution, 
includes half the area. 1 We thus see that Q = 0.6745a, as we 
discovered on page 145. It can also be shown that, when the 
distribution is normal, AD — 0.7979a , as stated on page 145 also. 
One should remember, then, that a distance equal to about 
two-thirds of the standard deviation laid off on each side of the 
mean will include half the cases in a normal distribution (see 
Fig. 7.4). 

1 See pp. 128 and 129. 
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These various relationships hold true strictly only when the 
distribution is exactly normal. It is seldom that empirical data 
show absolute normality, just as it would be unusual for a man 
throwing five pennies 32 times to get exactly one case where no 
heads turned up, 5 cases in which one head turned up, 10 cases of 
two heads, 10 cases of three heads, 5 cases of four heads, and one 
case when all the pennies turned up heads. With an infinite 
number of throws of five pennies, one should expect these propor¬ 
tions, 1 but in any finite number there might be some deviation 
from it. Thus also, even though we might get an exactly normal 
distribution of heights if we had an infinitely large number of 
cases, when we take a finite number such as 1000 cases we must 
expect some deviation from normality. Hence one can never 
interpret the standard deviation exactly as if the data were 
normal. We get approximations only, and the closeness of the 
approximation depends on the closeness with which the normal 
curve describes the data. When, however, the data are approxi¬ 
mately normal, we can interpret the standard deviation with a 
fair degree of exactitude. 

7.7, Preliminary Tests for Normality.—How can we discover 
whether or not a curve is approximately normal? There are 
many methods. We can group the data in a frequency table and 
see whether or not there tend to be large frequencies in the central 
classes and small frequencies in the end classes. We can plot 
the data in a frequency curve and see whether it looks roughly 
like the normal curve shown in Fig. 7.2, page 106. We can see 
if the description of the normal curve given on page 163 seems to 
fit the data. We can investigate to learn if about 68 per cent 
of the cases are included within lcr. We can see if Q is approxi¬ 
mately two-thirds of <r. Or perhaps even better, we can plot 
the ogive of the data on a special sort of graph paper called 
“probability paper” to see if it “straightens out.” We noticed 
in Sec. 3.14 that the graph of an ogive assumes a typical S-shape 
when the data are normally distributed. This characteristic 
S-shape appears in Fig. 5.2, page 97. Yet the S-shape indicates 
only that the original frequency distribution was mound-shaped, 
and not necessarily that it was normally distributed. If we 
convert the data of our ogive into percentage form, and plot 
them on probability paper, the ogive will turn into a straight 

1 See p. 161. 
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line if, and only if, the distribution was normal; and if the 
distribution is almost, but not quite, normal, the ogive on 
probability paper will fall almost into a straight line. 

The first step in the use of probability paper is to compute 
the data for a percentage ogive. These computations appear, 
starting with the data on student heights, in Table 7.3. The 


Table 7.3.—Data Put in Form for Probability Paper 


Height 
(centimeters) 
(class limit) 

Number of 
Students 

Number with 
Greater 
Heights 

Percentage 
with Greater 
Heights 

154 5 

4 

1000 

100.0 

157 5 

8 

996 

99.6 

160.5 

26 

988 

98.8 

163 5 

53 

962 

96.2 

166 5 

89 

909 

90 9 

169.5 

146 

820 

82 0 

172.5 

188 

674 

67 4 

175.5 

181 

486 

48.6 

178.5 

125 

305 

30.5 

181.5 

92 

180 

18.0 

184.5 

60 

88 

8.8 

187.5 

22 

28 

2 8 

190 5 

4 

6 

0.6 

193.5 

1 

2 

0 

196.5 

1 

1 

0.1 

199.5 

0 

0 

0.0 


first two columns are those of Table 5.5, page 94, except that 
the first column lists the lower class limits rather than the class 
marks. The third column, found by adding the items in the 
second column, shows the number of students who had heights 
greater than those listed in the first column. It will be noticed 
that our figures in this column start off with 1000, the total 
number of students, since all the students had heights greater 
than 154.4 cm. Since, however, there were 4 students with 
heights between 154.5 and 157.5 cm., there were only 996 whose 
heights were greater than 157.5 cm. Starting at the bottom of 
this third column, we find that no one had a height greater than 
199.5 cm., the actual upper limit of the tallest class. But there 
was one man whose height lay between 196.5 and 199.5, so we list 
one person taller than 196.5 cm. in the third column. There was 
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also one person whose height was between 193.5 and 196.5 em. 
so we have two people taller than 193.5 cm. Any figure in the 
third column can be found by adding to the number at the left in 
the second column all the numbers farther down in the second 
column. The fourth column is found by dividing the third 
column by the number of cases and then multiplying by 100. In 



Height, centimeters 


Fig. 7.5. Ogive plotted on probability paper, showing a distribution which 
is nearly normal 

this case, since the total number of cases is 1000, this can be done 
easily by pointing off one place. 

Now we transfer the data of Table 7.3 to probability paper, as 
in Fig. 7.5. The vertical lines are evenly spaced, but the hori¬ 
zontal lines are bunched closely together in the center and spread 
farther apart toward the top and the bottom. When we put the 
data of Table 7.3 on the chart, we find that the points fall almost, 
although not exactly, along a straight line. Thus we know that 
the students’ heights were distributed almost, but not exactly, 
in a normal curve. 
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Table 7.4.—Basic Data fob Use in Constructing Probability Paper 


Line 

Number 

Units from 
50% Line 

Line 

Number 

Units from 
50% Line 

50% 

0 

15.5 or 84 5 

1015 

49 or 51 

25 

15.0 or 85 0 

1036 

48 or 52 

50 

14.5 or 85.5 

1058 

47 or 53 

75 

14.0 or 86.0 

1080 

46 or 54 

100 

13.5 or 86.5 

1103 

45 or 55 

126 

13.0 or 87.0 

1126 

44 or 56 

151 

12.5 or 87 5 

1150 

43 or 57 

176 

12 0 or 88 0 

1175 

42 or 58 

202 

11.5 or 88 5 

1200 

41 or 59 

228 

11 0 or 89 0 

1227 

40 or 60 

253 

10.5 or 89.5 

1254 

39 or 61 

279 

10.0 or 90 0 

1282 

38 or 62 

305 

9 5 or 90.5 

1311 

37 or 63 

332 

9.0 or 91 0 

1341 

36 or 64 

358 

8.5 or 91.5 

1372 

35 or 65 

385 

8.0 or 92 0 

1405 

34 or 66 

412 

7.5 or 92.5 

1440 

33 or 67 

440 

7.0 or 93.0 

1476 

32 or 68 

468 

6.5 or 93.5 

1514 

31 or 69 

496 

6.0 or 94 0 

1555 

30 or 70 

524 

5.5 or 94.5 

1598 

29 or 71 

553 

5.0 or 95.0 

1645 

28 or 72 

583 

4.5 or 95 5 

1695 

27 or 73 

613 

4.0 or 96 0 

1751 

26 or 74 

643 

3.5 or 96 5 

1812 

25 or 75 

674 

3.0 or 97 0 

1881 

24 or 76 

706 

2.5 or 97 5 

1960 

23 or 77 

739 

2.0 or 98.0 

2054 

22 or 78 

772 

1.5 or 98.5 

2170 

21 or 79 

806 

1.0 or 99.0 

2326 

20 or 80 

842 

0.8 or 99.2 

2409 

19 or 81 

878 

0.6 or 99.4 

2512 

18 or 82 

915 

0.4 or 99.6 

2652 

17 or 83 

! 954 

0.2 or 99 8 

2878 

16 or 84 

994 

0.1 or 99.9 

3090 
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Probability paper can. be purchased from some dealers in draftsmen's 
supplies, but it is easy to make, and since most stores do not carry it, it may 
be worth while to give here the directions for making it. From the sample 
in Fig. 7.5, we see that the first thing to do is to lay out the required number 
of vertical lines, spacing them at convenient equal intervals. Next we 
locate the line marked 50 per cent, which is at the center of the vertical 
lines. The other lines are arranged symmetrically around this center line. 
That is, the distance to the line marked 30 per cent is the same as the distance 
to the line marked 70 per cent. The distances are given in Table 7.4. The 
first column of this table shows the line in question. The second column 
shows how many units the given line lies above or below the 50 per cent line. 
For example, the line which represents 75 per cent (and also the line which 
represents 25 per cent) lies 674 units from the center. These units are 
entirely arbitrary. Suppose, for example, that we are laying out a piece of 
probability paper on an ordinary sheet of 8 )^- by 11 -in. notebook paper. 
We might lay off our vertical lines at half-inch intervals. We might then 
decide that we wanted our horizontal lines to cover a distance of, say, 7 in. 
out of the total of 11 in.; that is, from the top horizontal line to the bottom 
horizontal line is to be 7 in. The 50 per cent line will be established first, 
exactly in the center, or 33^ in. from either the top or the bottom. Since 
the normal curve runs limitless distances either side of the center, we cannot 
show it all. Suppose we decide to show 98 per cent of the cases, from 1 to 
99 per cent, leaving off the two extremes. Then we know that the m. 
from our 50 per cent line to the bottom (or to the top) represents the distance 
to the 1 per cent (or the 99 per cent) line. We look in Table 7.4 and see that 
these lines lie 2326 units from the center. In other words, we let 3)^ in. 
represent 2326 units. Now if we want to find the 70 per cent line we note 
that it lies 524 units from the center, or 5 2 ^3 26 as far as the 99 per cent 
line, or 52 ^ 32 eths of 3H in., or 0.79 in. from the center. Other lines are 
found similarly. We decide first what line we shall select for our top (or 
bottom) line—whether 90, 95, or 98 per cent, etc. We note in the table 
how many units it is from the center. Then we locate the other lines by 
proportion. In this way we lay out whatever horizontal lines we want, 
keeping in mind the size of the sheet of paper on which we are working. 
The Btudent can verify the fact that in the illustrative case just given the 
90 per cent line (and the 10 per cent line) will be 1.93 in. from the center. 

One of the best criteria is to compute the normal curve which 
corresponds to the data themselves and to note how it agrees with 
the data. We have already noted that if we know the number of 
cases, the mean, and the standard deviation, we can find the 
normal curve. We can determine these constants from our 
original data and “fit” a normal curve to them. Let us try this 
procedure in the problem of the heights of Harvard students. 

The normal curve can be fitted by either of two methods. In 
both we make use of tables that describe a normal curve which has 
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a standard deviation of one class interval and in which the num¬ 
ber of cases is one, that is, in the curve described in the tables 

N - 1 

<x * 1 class interval 

This is called a unit normal curve. We find from the tables (see 
Appendix, pages 514 and 515) the values for such a curve, and 



Fig. 7.6. Numbers of Harvard students between the ages of eighteen and 
twenty-five years with various heights, 1914-1916. 

then convert these values to fit our particular problem. The 
procedure will be clearer if we work out examples. We shall use 
first the method of ordinates and then the method of areas. 

7.8. Fitting the Normal Curve: Method of Ordinates.—In any 
actual problem we should ordinarily start by plotting our data in 
a frequency curve, letting the ordinates represent frequencies 
and the abscissas represent the values of A. If we take the case 
of student heights, for example, the abscissas will represent 
heights. Such a diagram appears in Fig. 7.6. It will be noted 
that this diagram does look roughly like the normal curve. 

Our next step would be to compute the necessary constants: 
the mean, the standard deviation, and the number of cases. 
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These we have already computed for other reasons in the case of 
students' heights. They are 

AT = 2/ « 1000 
X = 175.335 cm. (see page 88) 
a = 6.582 cm. (see page 143) 

The two latter can be rounded off to 175.3 and 6.6. 

Since the normal curve is symmetrical, we know that the arith¬ 
metic mean, the median, and the mode will coincide; that is, the 
highest point on the curve (the mode) will be located at that point 
on the horizontal axis which represents 175.3 cm. (the mean). 
This tells us that the maximum ordinate will be located at 175.3 
on the horizontal scale, but it does not tell us how high the highest 
point will be. The maximum ordinate (greatest height) of a 
normal curve can always be found from the equation 

_ 0.3989 (Ct) (N) 


where y a is the magnitude of the maximum ordinate, Ci the class 
interval, etc. 1 If we substitute the values of our problem, we get 

_ (0.3989) (3) (1000) 1D1 0 

Vo = -7T7J -~ lol.d 


In other words, our normal curve will reach its maximum height 


1 We have seen (p. 100) that the formula for the normal curve is 


V 



J»/2 al 


We want the value of y when x « 0 (since x is the deviation from the mean 
and we want the value of y at the mean). But if we put x = 0 in this 
formula, we find that 

e -acVao- J = c 0 


Therefore at this particular point (the mode or the mean) the formula 
becomes 


N ^ 0.3989 N 

<r \/ 2 ir <r 


Since the deviations from the mean ( x ) in this equation are in terms of the 
class interval (CY), we must multiply our result by the class interval to get 
an answer in the units of our original problem. This gives the value of y 0 
given above. 
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at a point opposite 175.3 on the horizontal scale, and that maxi¬ 
mum height will correspond to 181.3 cases on the vertical scale. 

We could compute the heights of other points on the curve by 
further solutions of the general formula, 1 but the ifie of tables will 
save us considerable time. These tables show the height of the 
unit normal curve at various distances from the mean, and also 
the percentage of the total area under the curve which lies between 
a perpendicular erected at the mean and a perpendicular erected 
at various distances from the mean. We make our computations 
for the unit normal curve, and then convert the units of our own 
problem. The necessary tables are given in Appendixes I and 
II, pages 514 and 515. 

Reference to the table of ordinates will show, for example, that 
the unit normal curve has a height of 0.1295 at a distance of 1.5a 
from the mean. In any particular problem we find the height in 
the original units by multiplying the tabular value by ( N)(Ci)/a. 
In the problem we have studied, this means that we must multi¬ 
ply any tabular value by (1000)(3)/6.6 = 454.5. At 1.5a from 
the mean, therefore, the height will be 0.1295(454.5) = 58.9. 

Usually we are interested in the height of the curve at particular 
points: the class mid-points. These are the points for which the 
frequencies are known in the original problem, and we should like 
to know the theoretical frequencies at these points so that we can 

1 Let us illustrate for one additional case. If we take the general formula 
for the curve, and let N = 1 and a = 1 as in the unit normal curve, our 
formula becomes 

y - -4= e~*' n - 0.3989<r*’' J 
V 2ir 


What is the height of the curve at a point one standard deviation from the 
mean; that is, when x « 1? Our formula then becomes 


V 


0,3989e~^ 


0.3989 

Ve 


0.3989 

V2.71828 


0.242 


This tells us that the height of the curve at a distance of one standard devia¬ 
tion from the mean is 0.242 if AT *= 1, a — 1, and Ci *» 1. In any particular 
problem we must multiply our answer by ( N){Ci)/<r . Here this is 


(1000) (3) 
6.6 


= 454.5 


Carrying out the multiplication, we have 454.5(0.242) «* 110. At one 
standard deviation from the mean the height is 110 cases. 
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compare them. We find these easily, and the computations are 
summarized in Table 7.5 shown below. The first three 
columns of thiatable are taken directly from Table 6.4, page 140, 
the values in tne third column being rounded off. The third 
column represents the distance of each class mark from the mean, 
and is found by subtracting the value of the mean (175.3) from 


Table 7.5.— Fitting the Normal Curve by Ordinates 


Class Mark 

Observed 

Frequency 

Deviation 
from Mean 

Deviation 
in <r Units 

Tabular 

Value 

Computed 

Frequency 

(X) 

a) 

(x) 

(x/ct) 


(f) 

156 

4 

-19.3 

-2.92 

0.0056 

2.5 

159 

8 

-16.3 

-2.47 

0 0189 

8.6 

162 

26 

-13 3 

-2.02 

0.0519 

23.6 

165 

53 

-10.3 

-1.56 

0.1182 

53.7 

168 

89 

- 7.3 

-1.11 

0 2155 

97.9 

171 

146 

- 4.3 

-0.65 

0.3230 

146 8 

174 

188 

- 1.3 

-0.20 

0 3910 

177.7 

177 

181 

1.7 

0.26 

0.3857 

175.3 

180 

125 

4.7 

0.71 

0.3101 

141.0 

183 

92 

7.7 

1.17 

0 2012 

91 4 

186 

60 

10 7 

1.62 

0.1074 

48 8 

189 

22 

13.7 

2.08 

0.0459 

20.9 

192 

4 

16.7 

2.53 

0.0163 

7.4 

195 

1 

19 7 

2.98 

0 0047 

2.1 

198 

1 

22.7 

3.44 

0.0011 

0.5 


each of the figures of the first column. The fourth column states 
these distances in units of the standard deviation. Since the 
standard deviation in this problem is 6.6, we get the figures of the 
fourth column by dividing each figure of the third column by 
6.6. The fourth column, then, tells us how many standard 
deviations from the mean each class mark lies. We now look 
in the table of ordinates of the unit normal curve (Appendix II, 
page 515) and find the height of this curve at each of these devia¬ 
tions. These values appear in the fifth column of the table. For 
example, at a distance of 2.02a from the mean the unit curve has a 
height of 0.0519. 

Finally we must convert the data of the unit normal curve into 
the units of our own problem. As we have just discovered, 
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this is done by multiplying each ordinate of the unit curve by 
(N)(Ci)/<r. In our problem this means that each figure must be 
multiplied by (1000) (3)/6.6, or 454.5. The produptp are entered 
in the last column of the table. Thus each figuflnlt the last col¬ 
umn is the product found by multiplying the corresponding 
figure in the preceding column by 454.5. 

It is now possible to compare the frequencies which actually 
did occur (in the second column) with those which would have 
occurred in a corresponding normal distribution. The figures in 
the last column are those that would be found in a normal dis¬ 
tribution whose mean was 175.3, whose standard deviation was 
6.6, and which had 1000 cases. In other words, if we had an 
exactly normal distribution with mean, dispersion, and number 
of cases the same as those in our actual distribution, the cases 
would be distributed as they are in the last column of the table. 
The cases actually were distributed as in the second column. In 
the class from 187.5 to 190.49 (the class with a mid-point of 189) 
we did get 22 cases; in a normal distribution we should expect 20.9 
cases. 1 Similarly comparisons may be made at other points. 

7.9. Fitting the Normal Curve: Method of Areas.—We shall 
now fit the normal curve to the same data by the alternative 
method of areas. In this method we start with the lower limits of 
our classes rather than with the mid-points. As we discovered 
in Chap. Ill, if the class interval is 3 and the mid-point is 156, the 
lower limit of the class will be 154.5. Similarly we find the lower 
limits of the other classes. These are entered as the first column 
of our summary table. 

The figures in the second column are the actual frequencies 
with which heights occurred in the various classes. In the third 
column are given the distances of each class lower limit from the 
mean. These are found, of course, by subtracting 175.3 (the 
value of the mean) from each of the figures in the first column. 
In the fourth column these distances are expressed in terms of the 
standard deviation; that is, each entry in the fourth column is 

1 The student may find the decimals of the last column confusing. What 
do we mean when we say that we should expect 20.9 students in a particular 
class? Actually if the heights were normally distributed and we took many 
groups of 1000 students, we should sometimes find 19 students in this height 
group, sometimes 20, sometimes 21, etc. On the average we should find 20.9 
students in the class. 
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found by dividing the corresponding figure in the third column 
by the standard deviation (6.6). 

The figures^ the fifth column are derived from those in the 
tables of areas under the unit normal curve (see Appendix I, 
page 514). This table shows the percentage of the total area 
under a unit normal curve (that is, the percentage of the total 
number of cases in such a normal distribution) which lies between 
a perpendicular erected at the mean and another perpendicular 


Table 7.6.— Fitting the Normal Curve by Areas 


Lower Class 
Limit 

Observed 

Frequency 

(/) 

Deviation 
from Mean 

Deviation 
in <r Units 

Par cent of Total Area 

Below This 
Class 

In This 
Class 

154.5 

4 

—20.8 

-3.15 

0.1 

0.3 

157.5 

8 

-17 8 

-2 70 

0.4 

0 8 

160.5 

26 

-14 8 

-2.24 

1 2 

2.5 

163.5 

53 

-11.8 

-1.79 

3.7 

5.5 

166.5 

89 

- 8.8 

-1.33 

9.2 

9.5 

169.5 

146 

- 5 8 

-0.88 

18.7 

15 0 

172.5 

188 

- 2 8 

-0.42 

33.7 

17 5 

175.5 

181 

0.2 

0.03 

51.2 

17.2 

178 5 

125 

3.2 

0.48 

68.4 

14 2 

181.5 

92 

6 2 

0.94 

82.6 

9.2 

184 5 

60 

9.2 

1.39 

91.8 

5.0 

187.5 

22 

12.2 

1.85 

96.8 

2.1 

190 5 

4 

15.2 

2.30 

98.9 

0.8 

193.5 

1 

18 2 

2 76 

99.7 

0.2 

196.5 

1 

21.2 

3.21 

99.9 

0.1 

199.5 

0 

24.2 

3.67 

100.0 



erected at any given number of standard deviations from the 
mean. We discover from this table that 49.9 per cent of the area 
lies between the mean and a perpendicular erected 3.15a- below the 
mean. Always, of course, 50 per cent of the area lies above the 
mean (since the normal curve is symmetrical). Therefore a total 
of 49.9 per cent + 50 per cent = 99.9 per cent lies above this class 
limit. We conclude that 0.1 per cent must lie below this class 
limit. This is the first figure in the fifth column. 

Similarly our table tells us that 40.8 per cent of the area lies 
between the mean and a point 1.33<r below the mean. Therefore 
90.8 per cent of the area (50 per cent + 40.8 per cent = 90.8 per 
cent) must lie above this point and 9.2 per cent must'lie below it. 
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This gives us the fifth figure in the fifth column. When we come 
to the tenth figure in the column, we find that we are 0.94<r above 
the mean. The table in Appendix I tells us that 32.6 per cent 
of the area lies between this point and the mean. Since another 
50 per cent lies below the mean, we know that 82.6 per cent of the 
total area (50 per cent + 32.6 per cent = 82.6 per cent) lies 
below the lower limit of this class. This is the tenth entry in the 
fifth column. The other figures in the column are found similarly. 


Table 7.7.—Actual Distribution of Students’ Heights Compared with 
Estimates of the Corresponding Normal Distribution as Computed 
by tiie Method of Ordinates and by the Method of Areas 


Class Mark 

Actual Number 
Observed 

Number Expected by Method of: 

Ordinates 

Areas 

156 

4 

2.5 

3 

159 

8 

8.6 

8 

162 

26 

23 6 

25 

165 

53 

53 7 

55 

168 

89 

97 9 

95 

171 

146 

146.8 

150 

174 

188 

177 7 

175 

177 

181 

175 3 

172 

180 i 

125 

141 0 

142 

183 

92 

91.4 

92 

186 

60 

48.8 

50 

189 

22 

20.9 

21 

192 

4 

7.4 

8 

195 

1 

2.1 

2 

198 

1 

0.5 

1 


Each figure in column 0 is found by subtracting the correspond¬ 
ing figure in column 5 from the figure below it. The column is a 
column of differences. For example, if we subtract the first figure 
in column five (0.1) from the second figure (0.4), we get the first 
figure in column six (0.3). If we subtract the seventh figure in 
column 5 from the eighth, we get the seventh figure in column six. 
(51.2 — 33.7 = 17.5). The reason for this taking of differences 
is evident after a moment’s thought. The first figure in column 5 
tells us that 0.1 per cent of the total area lies below the bottom of 
the first class. The next figure tells us that 0.4 per cent lies 
below the bottom of the second class. The difference, 0.3 per 
cent, must lie in the first class. Similarly, if 33.7 per cent of the 
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area lies below #ie bottom of the seventh class and 51.2 per cent 
below the bottom of the eighth class, the difference, or 17.5 per 
cent of the area, must lie in the seventh class. 

Since the total number of cases is 1000, it is easy to convert 
this last column to actual numbers of cases expected. In any 



Fig. 7.7. Distribution of the heights of 1000 Harvard students, showing the 
actual and the normal distributions. The smooth curve drawn with a 
broken line is the normal curve fitted by the method of ordinates. 


such problem we should find the percentages of the total number 
of cases as listed in the sixth column. In our problem the 
expected numbers of cases are 3, 8, 25, 55, etc. These expected 
frequencies will be compared with the actual frequencies shown in 
the second column. It is evident that the results obtained by the 
method of areas and those obtained by the method of ordinates 
are not exactly identical. The two methods seldom show entire 
agreement. We can compare the actual frequencies with which 
various heights occurred with those which would have occurred in 
a normal distribution by arranging the actual figures and the 
estimates in parallel columns, as shown in Table 7.7 on page 183. 




SIMPLE PROBABILITY AND THE NORMAL CURVE 185 


It is obvious that the method of areas might also have been 
used to estimate to the nearest tenth of a case, as was the method 
of ordinates. It would have been necessary only to take our 
figures from the Appendix tables to one more place. However, as 
we noted in Chap. II, the adding of decimal places would give us 
only seeming increases in accuracy. Figure 7.7 shows graphically 
the agreement between the actual frequencies and those expected 
when we estimate by the method of ordinates. 

We have now discovered a number of ways in which data 
can be tested to see whether or not they are approximately 
normal in their distribution. In the next chapter we shall study 
further, more precise, and more accurate tests for normality, as 
well as learning something about other types of frequency curves 
which differ in their characteristics from the normal curve. At 
the end of that chapter will be found suggestions for further 
reading on the subject of frequency curves in general, and the 
references there given (see page 229) can be used by the reader 
who wishes to pursue further the work of this chapter as well. 

EXERCISES 

1. Give two original examples of statistical probability, of a priori 
probability. 

2. A baseball player has made 28 hits in 117 times at bat. How many 
hits is he expected to get in his next 25 times at bat? What are the chances 
that he will make 10 or more? 3 or less? 

3. What is the probability of drawing 3 hearts m succession from a pack 
of 52 cards, each card drawn being reinserted and the pack being shuffled 
before the next draw? 

4. What is the probability that we shall get exactly 4 heads in 6 throws 
of a penny? that we shall get 4 heads or more? 

6. Using the scheme used on page 161 to show the results of throwing 
5 pennies, diagram the possible results when 6 pennies are thrown. Com¬ 
pare your result with that shown in Table 7.1, page 163. 

6 . Continue Table 7.1, page 163, as it would be if two more rows were 
added at the bottom. 

7. Which is the more general term: “point binomial ” or ‘^normal curve”? 
Distinguish between them. 

8. We erect a perpendicular from the base line to a normal curve at a 
point 1.7<r above the mean, and another 0.6* below the mean. What per 
cent of the total area under the curve is within the space bounded by the 
two perpendiculars, the base line, and the curve? What per cent of the 
area is above the upper perpendicular? 

9. What odds would you offer that a Harvard student chosen at random 
will be between 185 cm. and 170 cm. tall? Use the results discovered in the 
text as a basis for your answer. 
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10. Three students take different tests. A gets a score of 72, B of 85, 
and C of 17. The average marks received on the three tests are 85, 90, and 
25, respectively. The three standard deviations are 7, 2, and 7, respectively. 
Arrange the three students in order of excellence as you would judge them 
by these results. 

11 . In a normal distribution X = 17 and cr = 3. What are the values of 
Q, AD, Q lf Q h Mo., and Med.? 

12. Fit a normal curve to the data given in Exercise 3, page 124. 

13. Fifty-three of 625 cases of diphtheria in Providence, Rhode Island, in 
1915 resulted in death. 1 Let us suppose that this ratio of deaths to total 
cases is correct for the universe of diphtheria cases. Suppose we have 
300 cases of diphtheria in an epidemic. What are the chances that there 
will be as many as 27 deaths? If there were 60 deaths, what would you 
conclude? If there were 15 deaths? If you knew that in the universe 
one would get an average of 53 deaths in 625 cases, how many deaths could 
occur in an epidemic of 300 cases before you would rule out chance as the 
cause of the increased fatalities and decide that the cases must be funda¬ 
mentally different from those of the universe mentioned? 

14. An instructor’s records show that he has, m the past, turned in failing 
grades for 12 of 140 students m elementary statistics. His present class 
numbers 20. IIow likely is it that every member of the present class will 
pass? That as many as 4 will fail? If you are sixth from the bottom 
of the present class, and if this class is comparable to past ones, what are 
the chances that you will fail? 

16. Make a sheet of probability paper on a sheet of 8 }£- by 11-in. note¬ 
book paper. Put the 98 per cent line at the top and the 2 per cent line 
at the bottom, l^et the distance between these two lines be 10 in. Put 
in the following horizontal lines: 95, 90, 85, 80, 75, 70, 60, 50, 40, 30, 25, 20, 
15, 10, and 5 per cent. 

16. Plot on the probability paper made in the preceding exercise, or on a 
piece furnished, the data of Table 5.9, page 124. Are the wages normally 
distributed? 

17. In Appendix II is a table showing the height of the normal curve 
at various distances from the mean. Using the data in this table, draw on a 
sheet of graph paper a picture of the normal curve. Locate the heights of 
the curve at each fifth of a standard deviation from the center, and connect 
them by a smooth, freehand curve. 

18. Find the median and the quartiles of the heights of students from the 
diagram of Fig. 7.5, page 174. Compare these answers which were found 
graphically with the computed answers on pages 95 and 129. 

19. Show from their formulas that the standard deviation of a probability 
distribution is always less than the square root of the arithmetic mean 
except in the limiting case where it may be as large as the square root of the 
arithmetic mean. 

20. Show that q = o*jX. 

1 G. C. Whipple, “Vital Statistics/’ p. 377, John Wiley & Sons, Inc., 
New York, 1923. 



CHAPTER VIII 


MOMENTS, FREQUENCY CURVES, AND THE 
CHI-SQUARE TEST 

In the preceding chapter we have studied what is, perhaps, the 
most common and most useful of all frequency curves—the 
so-called “normal curve.” We have discovered that this 
normal curve can be described completely in terms of the number 
of cases, the arithmetic mean, and the standard deviation. If 
we are told these three things we can draw the curve, or we can 
tell what number of cases will fall within any given area under 
the curve. 

So much emphasis is given to the norma) curve that the student 
sometimes draws the conclusion that it is the only frequency 
curve there is, or, at least, the only important one. We shall see 
in this chapter, however, that there are many other important 
frequency curves, and we shall learn that in order to identify and 
to describe them we usually need some information in addition 
to the values of N , X , and a . While the normal curve can be 
used to describe with reasonable accuracy a good many dis¬ 
tributions, the statistician soon learns that there are also many 
distributions which differ in character so much from the normal 
that a normal curve fitted to the data would be misleading 
rather than informative. In this chapter, we shall study some 
of these non-normal curves. 

8.1. The Higher Moments of a Frequency Distribution.—We 

have used the symbol x to represent the deviation of any item 
in a distribution from the arithmetic average of that distribution. 
We find the value of x by subtracting the value of the arithmetic 
mean from the value of the item (see pages 131 to 132). The 
arithmetic means of the various powers of these deviations in 
any distribution are called the moments of the distribution. If 
we take the mean of the first power of the deviations, we get 
the first moment about the mean; the mean of the squares of the 
deviations gives us the second moment about the mean; the 
mean of the cubes of the deviations yields the third moment 
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about the mean; etc. These moments can be defined by the 
following formulas if we let v x represent the first moment about 
the mean, the second moment, etc.: 

Ex 
n 

Ex 2 
n 

Ex 1 
n 

Ex 4 
n 

etc. 


Vi = 

v% = 

Vi = 


The Oth moment about the mean will, of course, be equal to 
Ex°/n. But as in the case of each item, regardless of the amount 
of the deviation, the Oth power of the deviation will equal 1, 
this is equivalent to n/n = 1. In any distribution, then, the Oth 
moment equals 1. We have discovered also that a = y/E(x 2 )/n 
(page 137). But this is the square root of the second moment 
about the mean, as will be seen from the formula above. We can 
thus say that the second moment about the mean — a 2 . This 
will be true of any distribution. 

When we were studying the average deviation we discovered 
that in any distribution the sum of the deviations of the items 
from the mean was equal to zero (page 131). But it will be noted 
that the formula for the first moment about the mean involves 
and that it reduces to zero, since Ex = 0. We can thus 
say some things about the moments of all curves in advance: 

(1) vo = 1 

(2) V! - 0 

(3) v 2 - cr 2 

There are one or two other things that can be deduced about 
the moments. If the curve is symmetrical there will be a devia¬ 
tion below the mean which exactly equals each deviation above 
the mean. This is what we mean by symmetry. If this is true, 
then these positive deviations and negative deviations will exactly 
balance each other and when added will cancel out. Of course, if 
the deviations are raised to even powers their signs will all be 
positive, and they will no longer cancel out. But the sums of the 
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odd powers will all be equal to zero on account of the cancella¬ 
tions. We thus know in advance that in any symmetrical curve 
the odd moments, being based on the sums of odd powers of 
deviations, will equal zero. That is, in symmetrical distributions 

Vz = 0 

v& = 0 
v 7 *= 0 
etc. 

This does not hold true in asymmetrical distributions. The rules 
which we laid down for v Q) v h and v 2 hold true for any distribution. 
The rules just enunciated for odd-powered moments above the 
first hold true only if the distribution is symmetrical. For this 
reason we can use them, and do use one of them, as measures of 
asymmetry (see page 204). 

8.2. Computation of the Higher Moments.—We could, of 

course, compute the higher moments directly from their formulas. 
Since the third moment about the mean is Zx^jn, we could find 
the deviation of each item from the mean, cube it, and divide by 
n. Following our earlier practice where data are grouped in 
frequency tables we should, in such cases, use the formula 
2(/x 8 )/n. But, as before, it pays here to use a short method in 
which we guess at a mean, take our deviations in units of the 
class interval, and carry on our computations, finally adjusting 
our results to take care of the difference between our guessed 
mean and the true mean. This method has become familiar to us 
in computing the mean and the standard deviation (pages 87 
and 141), and we shall not go into the details of the theory of it 
here. We shall, however, give an example. Still using our 
data on the heights of Harvard students, let us compute the third 
and the fourth moments about the mean. (Higher moments 
would be handled similarly.) The process is illustrated in 
Table 8.1. 

In adding totals be careful to keep track of signs. The first 
five columns of this table are taken directly from Table 6.5, where 
we used these same figures in computing the standard deviation. 
The computation of the figures in the remaining two columns 
is evident. Each figure in column 6 is the product of the cor¬ 
responding figures in columns 3 and 5; that is, (d)(fd 2 ) = (/d*). 
Similarly, each figure in column 7 is the product of the corre- 
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sponding figures in columns 3 and 6. Thus (d)(fd s ) — (/d 4 ). 
Had we not computed the mean and the standard deviation, it is 
obvious that we could do it directly from the figures given here, 
since this is the method heretofore used for their computation. 

We cannot compute the moments about the mean directly 
from these figures, since these figures show deviations about an 
assumed mean. (Here the assumed mean is 177 cm.) Hence 
we compute first the moments about the assumed mean. Just 
as we symbolize the assumed mean by X' instead of by X, in 
order that it may be distinguished from the true mean, so we shall 


Table 8.1.— Computation of the Higher Moments: Heights of 
Harvard Students 


Class 

Mark 

c X) 

Fre¬ 

quency 

if) 

Class 

Deviation 

(d) 

' fd 

fd> 

fd* 

fd * 

156 

4 

-7 

- 28 

196 

-1,372 1 

9,604 

159 

8 

-6 

- 48 

288 

-1,728 

10,368 

162 

26 

-5 

— 130 

650 

-3,250 

16,250 

165 

53 

—4 

—212 

848 

-3,392 

13,568 

168 

89 

-3 

-267 

801 

-2,403 

7,209 

171 

146 

-2 

-292 

584 

-1,168 

2,336 

174 

188 

-1 

-188 

188 

- 188 

188 

177 

181 

0 

0 

0 

0 

0 

180 

125 

1 

125 

125 

125 

125 

183 

92 

2 

184 

368 

736 

1,472 

186 

60 

3 

180 

540 

1,620 

4,860 

189 

22 

4 

88 j 

352 

1,408 

5,632 

192 

4 

5 

20 

100 

500 

2,500 

195 

1 

6 

6 

36 

216 

1,296 

198 

1 

7 

7 ! 

i 

49 

343 

2,401 

Totals... 

1,000 


-555 

5,125 

-8,553 

77,809 


represent the first, second, third, and fourth moments about the 
assumed mean by v ly v 2 , v z , and v\, respectively, in order that 
they can be distinguished from the moments about the mean. 
The formulas for the moments about the assumed mean follow: 


, ^ S/d ^ -555 

n 1000 

' = 2/d 2 5125 

' 2 n 1000 


= -0.555 
= 5.125 
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, 2/d 3 -8553 

= _____ 

, _ 2/tf 4 _ 77,809 
V * n 1000 


-8.553 

77.809 


The general formulas appear at the left, and at the right we have 
substituted the values found for this particular problem. 

Now comes the problem of shifting from the assumed mean to 
the true mean. The formulas for the moments about the mean in 
terms of the moments about the assumed mean follow: 

cam __ aww) = 0 

n n 

Ci 2 (v ' 2 - v[ 2 ) 

Ci\v \1 - 3v' 2 v[ + 2r;*) 

Ci A (v 4 — 4v' 8 v[ + 2 — 3yJ 4 ) 

If we substitute in these equations the values of our problem and 
solve, we get the following results: 

_3( —555) 3(—0.555)(1000) 

Vl 1000 1000 ' u 

The first moment about the mean must always equal zero. It is 
worth while to substitute the proper values in the equation for Vi 
and solve as a check on the arithmetic, since unless a mistake has 
been made the result must equal zero. 

- 9(5.125 - 0.308) - 9(4.817) - 43.353 

It will be remembered that this is the square of the standard 
deviation. Had we not computed a before we should now 
compute \/43.353 = 6.58 = a. Compare this with the a found 
before on page 143. The value of v 2 is always the square of <r. 

r 3 - 27[ —8.553 - 3(5.125)(-0.555) + 2(-0.555 3 )] = -9.774 
r 4 » 81[77.809 - 4(-8.553)(-0.555) + 6(5.125)(0.308) 

-3(0.094864)] = 5508.567 

These are the moments of the distribution about the mean. If 
we gather together our results relative to the distribution of 
students’ heights, we find 


Vi = 

v% = 

Vi = 

v A = 
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0 * * 0 
vt * 43.353 
v t - -9.774 
Vi - 5508.567 


It is obvious that the curve is not exactly symmetrical, for if it 
were the value of v* would be zero. It is in fact —9.774. But we 
cannot tell whether this is a large or a small deviation from 
symmetry merely by the size of Vz. In the case of t> 8 we are deal¬ 
ing with the third power of deviations from the average. To 
judge the degree of asymmetry we must relate v z to the standard 
deviation, and since the deviations are cubed we relate it to the 
cube of the standard deviation. Similarly we can relate the 
fourth moment to the fourth power of the standard deviation. 
The various moments divided by the proper power of the stand¬ 
ard deviation give us another group of useful coefficients which 
we represent by the Greek letter a (alpha). We can define them 
thus: 


vi n 

QJl = — = 0 

a 


OL 2 


- 1 


v z 

a 5 = — 3 



etc. 


These measures are read as “alpha one,” “alpha two,” “alpha 
three,” etc. 

It can be demonstrated 1 that the values of a% and a 4 for the 
normal curve are always 0 and 3, respectively. Thus we can test 
the curve of students’ heights by computing these constants. 
The computation follows: 


otz 



a 4 


iu 

cr 4 


_ 9.774 

6.6 s 

5508.567 

6.6 4 


-0.034 
= 2.926 


1 Rijstz et al. t “Handbook of Mathematical Statistics/’ p. 97, Houghton 
Mifflin Company, Boston, 1924. 
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If we compare these two figures with those which would have 
occurred had the heights been normally distributed, we find that 
this curve is approximately normal. 

8.3. Checking Accuracy of Computations.—We have learned 
earlier (see Secs. 5.4 and 6.8) that it is possible to check the 
accuracy of arithmetical computations by means of what is 


Table 8.2.—Charlier Check for Accuracy of Computations—The 

Moments 


Class 

Mark 

(X) 

Fre¬ 

quency 

(/) 

Class 

Deviation 

(d + 1) 

f(d + 1) 

f(d + D* 

f(d + l) 3 

f(d + l) 3 

156 

4 

— 6 

- 24 

144 

- 864 

5,184 

159 

8 

—5 

- 40 

200 

-1,000 

5,000 

162 

26 

— 4 

-104 

416 

-1,664 

6,656 

165 

53 j 

-3 

-159 

477 

-1,431 

4,293 

168 

89 

—2 

-178 

356 

- 712 

1,424 

171 

146 

-1 

-146 

146 

! - 146 

146 

174 

188 

0 





177 

181 

1 

181 

181 

181 

181 

180 

125 

2 

250 

500 

1,000 

2,000 

183 

92 

3 

276 

828 

2,484 

7,452 

186 

60 

4 

240 

960 

3,840 

15,360 

189 

22 

5 

110 

550 

2,750 

13,750 

192 

4 

6 

24 

144 

864 

5,184 

195 

1 

7 

7 

49 

343 

2,401 

198 

1 

8 

8 

64 

512 

4,096 

Totals. 

1000 


445 

5015 

6,157 

73,127 


called the “ Charlier check.” This check really consists, as 
can be seen by looking back to the earlier examples, in choosing 
another guessed mean as a starting point at the class mark of 
the next smaller class, so that the values of d are each increased 
by unity. We use the same general method when computing the 
higher moments, although it is probably easier to go through 
the process once, as we did in Table 8.1, and then set up an 
entirely new second table with a new arbitrary zero point, as in 
Table 8.2. It will be noticed that the first two columns of 
Table 8.2 are exact duplicates of the first two columns of Table 
8.1, but in the third column each value of d is greater by one than 
it was in the earlier table. The zero point is taken in the preced¬ 
ing class where the class mark is 174 cm. instead of at 177 cm. 
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as in Table 8.1. We then go through exactly the same processes 
which we used in the preceding section. In order to show the 
connection between the two tables, we label the third column 
d 4 l^since each number in it is found by adding one to the 
correspdS^ding entry in Table 8.1. 

We now make use of the following equations, which the 
student can easily derive for himself after the fashion of the 
derivations in the footnotes on pages 92 and 144. 

2/(d + 1) - 2(/d) + N 

2f(d + l) 2 = Xfd 2 + 22 fd 4 N 

Xf(d 4 l) 3 = S/d 8 4 3 S/d 2 4 3 S/d + N 

S/(d + l) 4 - S/d 4 4 4S/d 8 4 6S/d 2 4 4S/d 4 N 

The first two of these equations have been used heretofore in 
checking our computation of the arithmetic mean and the 
standard deviation, but are repeated here so that we may have 
all the customary Charlier equations together. The last two 
equations are an obvious extension for the third and the fourth 
powers of d 4 1. If we substitute the values from Table 8.2 in 
the left-hand members of these equations, and the values from 
Table 8.1 in the right-hand members, we get the following: 

445 - -555 4 1000 
5015 = 5125 4 2( — 555) 4 1000 
6157 - -8553 4 3(5125) 4 3(-555) 4 1000 

73,127 - 77,809 4 4(~8553) 4 6(5125) 4 4(-555) 4 1000 

Since these four equations all check out when we evaluate the 
right-hand members, we know that our arithmetical work has 
been accurate. We could, of course, compute the values of the 
arithmetic mean, standard deviation, and the alphas quite as 
well from Table 8.2 as from Table 8.1. The computations in 
the tables are very rapid, and hence the check consumes rela¬ 
tively little time. 

8.4. Grouping Error.—Our computations of the moments have 
been carried out under the assumption that all the items in each 
class are concentrated at the class mark. We know that this is 
ordinarily not actually true, but we have assumed that we need 
not worry about it if the number of cases is sufficiently large and 
the class interval sufficiently small. With a small class interval 
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no item can be very far from the class mark; and if in one class 
a preponderance of the items fall below the class mark, it is reason¬ 
able to assume that, in another class, a majority of the items 
will fall above the class mark, so that the errors will caiieil each 
other out. The errors will be what we have called compensating 
errors (see Sec. 2.2), 

When we stop to think about it, however, we realize that while 
the errors may cancel themselves out, half being positive and 
half negative, the squares of the errors will all be positive, and 
will not be compensating. In computing the standard deviation 
or ou or any other statistic based on the second moment or the 
fourth moment or any other even numbered moment, we need to 
remember that we have introduced a noncompensating error by 
assuming that the values within a class are all equal to the class 
mark, and this noncompensating error, being always positive, 
gives us values larger than we should really get. The values of 
<r, the variance, and a 4 computed from frequency tables are all too 
large, and need to be reduced slightly to take care of what we 
call grouping error. This is the error which is introduced into 
the computation of even-numbered moments by assuming that 
all items in each frequency class are equal to the class mark. 

Fortunately it is possible to make simple corrections for this 
grouping error in those cases where continuous data have been 
grouped in frequency tables if the frequency curve tends to 
approach the base line gradually and slowly at each end of the 
distribution. The corrections are small, and the statistician is 
foolish to bother with them if his original figures are rough 
approximations. But where we have continuous data with the 
characteristics just described, and where the original measure¬ 
ments are reasonably precise, we may well apply Sheppard’s 
corrections to eliminate the grouping error. Since our data on 
students’ heights approximate these requirements, we may illus¬ 
trate with them the application of these corrections. 

The moments that we have computed, which have not been 
corrected by Sheppard’s process, are called the crude moments , 
to distinguish them from the adjusted moments which we get by 
applying Sheppard’s corrections. 

The first and third moments need no correction. If we let g 2 
stand for the adjusted second moment and g 4 for the adjusted 
fourth moment, we apply the corrections thus: 
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Ci* 

" * " 12 

Hi = Vi 

v i Ci i . ICi* 

lu Vi 2 + 240 

If the class interval is one, the application is obviously simplified. 
If we correct the moments by Sheppard's correction, we use the 
corrected moments rather than the crude moments in computing 
the values of <r, a 3 , and a 4 . 

Applying Sheppard's corrections to our problem of the dis¬ 
tribution of students' heights, we get the following adjusted 
moments: 

= 43.353 43.353 - 0.75 = 42.603 

Hz — vz = —9.774 

* - 5608.567 - (433 f< 9) + »! - 5315.84 


If, now, we use these corrected moments in computing <r, a if and 
a 4 , we have 


a = y/ hi = y/ 42.003 — 6.53 


Ms _ 9.774 
o- 8 " 2.874 

M4 _ 5315.84 
cr 4 1818.2 


-0.035 

2.93 


If the adjusted and the crude results are compared, we have 



Crude 

Adjusted 

2d moment. 

43.353 

42.603 

4th moment. 

5508.567 

5315.84 

. 

-0.034 

-0.035 

<*4. 

2.926 

2.93 

<r . 

6.58 

6.53 


It will thus be seen that the corrections bring but minor changes 
in the value of the a terms. 

8.5. Moments of Probability Distributions. —We learned in 
Sec. 7.2 that the arithmetic mean and standard deviation of 
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probability distributions could be computed quickly and easily 
by means of the formulas 

<r = \Znpq 


Now that we have studied the higher moments, we can add two 
similar useful formulas 1 for the values of the alphas: 


as 


a 4 


<1 - P 


y/npq 


1 

npq 


+ 3 = 
n 


I-« + 3 

<r 2 n 


These formulas will hold for any point binomial distribution 
found by evaluating (q + p) n . The student will note that, as 
n grows extremely large, the value of a 3 approaches zero and the 
value of a 4 approaches 3. But these are the values in a normal 
distribution. Hence we see that the point binomial distribution 
approaches the normal distribution as n gets extremely large. 
It is also apparent that a 3 is zero whenever q = p, and therefore 
in such cases we get symmetrical distributions. 

Perhaps it is not quite so evident that point binomial distribu¬ 
tions are entirely fixed in terms of their arithmetic means and 
their standard deviations. Yet the student will note that the 
formulas for a 3 and a 4 can be stated in the alternative form: 


OLz 


a 4 


2cr 

X 

1 


1 


cr 


(\( X - 



Here it is evident that if we know the values of X and a we can 
find the values of a 3 and a 4 immediately. Figure 7.1, page 164, 
shows the values obtained when we raise q + p to the 14th power 
if both p and q equal Yi \ that is, we have the values of the terms 
of (Y + H) u - We now see from our formulas that for this dis¬ 
tribution the values are 

1 For proof of these formulas, as well as proofs of the formulas for the 
arithmetic mean and standard deviation, see John F. Kenney, “Mathematics 
of Statistics," Vol. II, pp. 11-15, D. Van Nostrand Company, Inc., New 
York, 1939. 
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X - 14(H) = 7 
a = vmxm -1.87 


«3 


0.5 - 0.5 
1.87 


= 0 


“ 4 = STfi “ Ti + 3 “ 2 86 


We also note immediately from Fig. 8.1 that when p and q are 
not equal, the point binomial will be skewed. Suppose we test 





Fig. 8.1. Point binomials for (q -f- p) 8 with various values of p. 


this for the case, say, where p — 0.3 and q = 0.7. As always, we 
have p + q = 1. If we raise this binomial to the eighth power, 
we get 
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(« + vY - (0.7 + 0.3) 8 

- 0.7 8 + 8(0.7 7 )(0.3) + 28(0.7 6 )(0.3 2 ) + 56(0.7 6 )(0.3 8 ) 
+ 70(0.7 4 )(0.3 4 ) + 56(0.7 8 ) (0.3 6 ) + 28(0.7 2 )(0.3 6 ) 

+ 8(0.7) (0.3 7 ) + 0.3 8 

If, now, we evaluate each of these terms, we find the following: 


(0.7 + 0.3) 8 « 0.05764801 + 0.19765032 + 0.29647548 
4- 0.25412184 + 0.13613670 + 0.04667544 + 0.01000188 

+ 0.00122472 + 0.00006561 


If we plot these terms, as in the lower left-hand section of Fig. 8.1, 
we obtain an asymmetrical or skewed distribution, as contrasted 
with the symmetrical distribution of Fig. 7.1 or the symmetrical 
distribution in the center of Fig. 8.1. In Fig. 8,1 we see the point 
binomials obtained by raising (q + p) to the eighth power, with 
varying values of p and q. When p and q are both equal to 
we get the symmetrical point binomial in the center of the figure, 
and the more p (or q) differs from the more skewness becomes 
evident. 

In the particular case which we have just worked out, where 
p = 0.3 and q — 0.7, we may try applying our simple formulas 
for probability distributions. We find that 


X = np = 8(0.3) = 2.4 
<r = Vnpq = V(8)(0.3)(0.7) * \/lM = 1.30 
(g - p) _ (0.7 - 0.3) 

(7 1.30 


<*8 


0.308 


OL i 


1 

npq 


6 


+ 3 - 


1 

1.68 


+ 3 = 2.845 


It is the fact that p and q are unequal which has been responsi¬ 
ble for the skewness in eight of the nine sections of Fig. 8.1. Let 
us note in the above formulas the effect of changing p or q when 
we hold n constant. Let us start with p = q = and then 
increase the size of p slowly, reducing the size of q always so that 
p + q = 1. We note that X will increase. But if we increase p 
and decrease q, keeping their sum equal to 1, their product pq will 
diminish, as the student will discover immediately by experiment. 
Therefore <r will diminish as p moves away from 0.5 in either 
direction. Since a 8 is based on q — p, it will be negative when q 
is smaller than p and positive when q is larger than p; and the 
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farther p or q is from 0.5 the greater will be the difference between 
them; so the greater will be the absolute size of a 3 . Finally, as 
p or q gets farther from 0.5, the value of pq in the denominator of 
the last formula will diminish, thus increasing the value of the 
fraction and of a 4 . To summarize, we note that X grows larger 
whenever p increases (if we hold n constant). We note also that, 
if n does not change, the sizes of the other three values depend on 
the amount of difference between the values of p and q. The 
greater the difference between p and q } the smaller the value of g 
the larger the value of a 3 and the larger the value of a 4 . 

It is now time for us to learn more definitely how to interpret 
these results, as we do in the next two sections. 

8.6. Measures of Skewness. —Thus far we have confined our 
description of frequency curves in the main to measures of central 
tendency (averages) and measures of dispersion. These two 
types of measures tell us a good deal about the character of the 
distribution. For example, when we have discovered that the 
average height of a group of students is 175.3 cm. and the standard 
deviation of heights in the group is 0.6 cm., we know (if the dis¬ 
tribution is roughly normal) that about two-thirds of the students 
have heights between 168.7 and 181.9 cm. We know also that 
almost never should we find a student shorter than 155.5 cm. or 
taller than 195.1 cm. (the mean plus and minus 3<r). 

We have discovered, however, that the normal curve is sym¬ 
metrical. We could be somewhat more confident in our inter¬ 
pretation of the mean and the measures of dispersion if we 
knew that our distribution was symmetrical. We have seen 
that the distribution of incomes in the United States is not 
symmetrical (page 113), and with any distribution we may well 
wish to test the symmetry. When a distribution is asymmetrical 
we usually call it a skewed frequency distribution, and the meas¬ 
ures of asymmetry are usually called measures of skewness. 

Many measures of skewness have been proposed, and none has 
been uniformly adopted. For this reason, when one gives a 
measure of skewness it is necessary to indicate the method by 
which it was computed. The commoner methods are given 
here. 

If our distribution is mound-shaped (that is, if it has small 
frequencies at the extremities and larger frequencies toward the 
center) and symmetrical, the mean, the median, and the mode 
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will coincide. If the curve is skewed, these measures will not 
coincide. Thus it is possible to acquire some idea of the absolute 
amount of the skewness by noting the amount of the divergence 
between any two of these measures of central tendency. If we 
wish to measure relative skewness, we shall have to compare the 
displacement of the averages with some standard measure of 
dispersion. Usually we should use the standard deviation for the 
latter measure. Karl Pearson has suggested the following as a 
measure of relative skewness: 


8 k. = ^^ 

<r 

We have discovered the foil ow¬ 
ing values for these constants in 
the case of the heights of Har¬ 
vard students: 

1 - 175.3 (page 88) 

Mo. = 175.2 (page 99) 
a = 6.6 (page 143) 

Substituting these in the equa¬ 
tion for skewness, we have 

ai (175.3 - 175.2) 

S k. =- - 

0.1 



It is evident from the formula 1=/_ 

that the skewness may be either ^ m 

positive or negative. It will be upper curve exhibit8 poaitive skow _ 

positive when the mean exceeds ness, and the lower curve negative 

the mode and negative when skewness. 

the mean is smaller than the 

mode. Such cases are illustrated in Fig. 8.2. 1 The upper part of 
the figure shows a distribution in which the mean is pulled toward 


1 The upper part of Fig. 8.2 shows the distribution of hourly earnings of 
2960 employees of filling stations in the United States in 1931. Data from 
United States Bureau of Labor Statistics Bulletin 578, p. 9. 

The lower part of Fig. 8.2 shows the distribution of the annual egg produc¬ 
tions of 3131 white Leghorn hens. Data from Starrs Agricultural Experiment 
Station Bulletin 147, p. 246. 
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the right by the few extremely high cases: this is positive skewness. 
In the lower part of the figure the mean is pulled toward the left 
by the few very small cases: this is negative skewness. 

We have discovered earlier (page 08) that the mode is difficult 
to find, and that different methods of locating it give different 
results. For that reason Pearson's formula, given above, is not 
entirely satisfactory. We have seen also that, when the asym¬ 
metry is not great, the averages have the following relationship: 1 

Mo. - 3 Med. - 2X 

If this value of Mo. is substituted in our equation for skewness, it 
becomes 

gk __ (X - 3 Med. +2X) 
o 

= (X - Med) 

<7 

If we substitute the values of the Harvard student problem in this 
equation, remembering that the median was found to be 175.3 
(page 95), we have 

St. - 3(175 3 - 1,5 3> ■ 0 

b.6 

If we take the figures for the mean and the median as originally 
computed before rounding off (see pages 88 and 95), we have 

St. - - 17 5.28) , 0 025 

0.0 

Again we find that the skewness is very small in this distribution. 
And again the skewness, what there is of it, is positive. 

In a symmetrical distribution the quartiles would, of course, bo 
equidistant from the median; that is, Med. — Qi = Qz — Mod., 
if the distribution is symmetrical. If the distribution is not sym¬ 
metrical, the quartiles will not be equidistant (unless the entire 
asymmetry is located in the extreme quarters of the data, or 
unless there is some very peculiar arrangement of the data within 
the central quartiles). 

These facts have led Bowley to suggest the following as a meas¬ 
ure of skewness: 

1 See p. 99. 
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c. (Q> ~ Med.) - (Med. - QO 
Sk ' - QT=0l - 

__ Qa + Qi — 2 Med. 

Qa ~ Qi 


If we apply this measure of skewness to our data, we find 


Qz = 179.84 (page 129) 

Qi = 170.95 (page 129) 

Med. - 175.28 (page 95) 
q _ (179.84 + 170.95 - 350.56) 
(179.84 - 170.95) 


0.23 

8.89 


- 0.026 


This measure is always equal to zero when the quartiles are equi¬ 
distant from the median and is positive when the upper quartile 
is farther from the median than the lower quartile. In general 
this measure of skewness and Pearson's measure should have the 
same sign. Here the skewness is so slight that a change in sign 
might well occur, since there is practically no skewness and the 
measure may fall either side of zero almost by chance. Bowley's 
measure of skewness cannot exceed 1 in absolute size—that is, it 
varies between +1 and — 1. 

Bowley's measure neglects the two extreme quarters of the 
data. It would be better for a measure to cover a larger part of 
the data, especially since in measuring skewness we are often 
especially interested in the more extreme items. Bowley's meas¬ 
ure can be extended by taking any two deciles equidistant from 
the median or any two percentiles equidistant from the median. 
For example in a symmetrical distribution the 2d decile and the 
8 th decile are equidistant from the median. We could determine 
whether or not they were equally distant from the median in any 
given distribution and base a measure of skewness on them. 
Likewise we could base a measure of skewness on the distances of 
the 4th and 96th percentiles from the median. Kelley has shown 1 
that the percentiles having the least error in a normal distribution 
are the G.917th and 93.083d percentiles (or, rather, that the range 
between these percentiles has less error than the range between 
any other two percentiles). For this reason Kelley suggests a 

Kelley, “Statistical Method/’ p. 75, The Macmillan Company, New 
York, 1924. 
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measure of dispersion based on the 10th and the 90th percentiles 
(the 1 st and 9th deciles) as being close to this point of greatest 
accuracy. He suggests as a measure of skewness the following: 

Sk. = (P9 ° ±^ o) - I\0 

This measure of skewness has been but little used, but has some 
theoretical attractions if skewness is to be based on percentiles (or 
deciles or quartiles) at all. 

One of the most useful measures of skewness, however, is a 
measure which we have already obtained. This is as. It gives 
a value of zero for the normal curve, as we have seen. It is posi¬ 
tive if the mean is larger than the mode, and negative if the mean 
is smaller than the mode. It thus agrees in sign with the meas¬ 
ures we have previously considered. 1 Its interpretation is made 
somewhat simpler if we understand that the following relationship 
is approximately correct in slightly skewed distributions: 2 

(X — Mo.) _ a 3 

a 2 

or 

2(X - Mo.) 

«a =- 

<7 

Then as is equal to twice the distance from the mean to the mode, 
expressed in units of the standard deviation. We have discovered 
in our illustrative case that as = —0.035. This means, we now 
discover, that the mean is smaller than the mode, and that they 
are separated by an amount equal to (0.035) or 0.017cr. The 
student may be interested to compare this measure of skewness 
with the values of as obtained from the skewed distributions 
depicted in Fig. 8.2, page 201. For the upper distribution we 
find as — +0.307; in the lower distribution as = —0.518. 

1 This measure, is useful also in that we can use it with tables of the 
areas under skewed curves. Just as we used tables of areas under the normal 
curve and ordinates of the normal curve on pp. 177#., so we can use tables 
for skewed curves if we know the value of as. Tables for use in this connec¬ 
tion have been published by Salvosa in Annals of Mathematical Statistics , 
Vol. I, pp. 191#. 

* Camp, “Elementary Statistics,” p. 47, D. C. Heath and Company, 
Boston, 1931. 
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It should also be noted that this relationship gives us a new 
method for computing the value of the mode. In fact this 
method, although more tedious than those suggested in Chap. IV 
because it involves the computation of the higher moments, gives 
better results than any of the methods heretofore described. If 
we compute the modal height of Harvard students by this 
formula, we have 

az = —0.035 (page 196) 

X = 175.335 (page 88) 
c t = 6.582 (page 143) 

(175.335 - Mo.) _ -0.035 
6.582 2 

Mo. - 175.45 


Better yet as an estimate of the mode, but requiring still more 
computation, is the following: 1 

X - Mo. = VWi (fit ± 3) 

<r 2(50i ~ 60 x - 9) 


where Si = « 3 2 and — « 4 . 

This, again, can be used as a measure of skewness, being positive 
if the mean exceeds, and negative if the mean falls short of, the 
mode. Computing the value of this measure of skewness from 
our illustrative data, wc have 


X = 


fit 

02 


Sk 


175.335 (page 88) 

6.582 (page 143) 

0.00117 (page 196) 

2.931 (page 196) 

V'0.001117 (2.935 + 3) 
2(14.675 - 0.00702 - 9) 

- = o 0179 

11.34 


This shows an extremely slight positive skewness. If we use 
this value for computing t he mode, we have 


175.335 - Mo. 
' 6.582 


- 0.0179 


Mo. - 175.217 


1 Mills, “Statistical Method,” p. 546, Henry Holt and Company Inc., 
New York, 1924. 
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This value for the mode is almost identical with the value dis¬ 
covered by the other method based on moments. It is probably 
the best estimate we can make of the modal height of the 1000 
Harvard students. 

The student will now understand why he was told on page 200 
that it is always necessary to accompany any measure of skewness 
with a statement of the method of computation. We have com¬ 
puted the skewness of the student heights by several methods, 
and have found varying answers. Let us collect the answers 
for purposes of comparison: 

Sk. - +0 015 (page 201) 

Sk. = +0.025 (page 202) 

Sk. = +0.026 (page 203) 

Sk. = —0.035 (page 204) 

Sk. = +0.0179 (page 205) 

It will be noted that, although there is some variation among 
these measures, as is to be expected since they have been com¬ 
puted by decidedly different methods, nevertheless the five 
results are nearly identical in size. The fact that the measures 
differ in sign, one being negative and the other four positive, is 
unimportant, since they are all approximately equal to zero. 
The extreme difference, between the third and the fourth meas¬ 
ures, amounts to but 0.061; that is, the differences are confined 
to the second decimal place. 

8.7. Measures of Kurtosis.—We have studied measures of 
central tendency, measures of dispersion, and, just now, measures 
of skewness. There remains but one more common type of meas¬ 
ure of the characteristics of a frequency distribution. These 
measures are called measures of kurtosis . They show the extent 
to which the distribution is more peaked or more flat-topped than 
the normal curve. 1 If the items are more closely bunched around 
the mode than normal, making the curve unusually peaked, we 
say that the curve is leptokurtic. If, on the other hand, the curve 
is more flat-topped than normal, we say that it is 'platijkurtic. 
The normal curve itself is mesokurtic . The condition of peaked¬ 
ness or of flat-toppedness itself is known as kurtosis or excess . 

1 While this is usually true it seems impossible to rely on it without excep¬ 
tion. The student will have to consider interpretations of measures of 
kurtosis as approximate. 
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The principal measure of kurtosis is the value which we have 
already computed and called a 4 , It is also sometimes symbolized 
by 02 , the two being identical, and, as we have seen, being defined 
thus: 



In the normal curve, a 4 and 0 2 equal 3. When they are greater 
than 3, the curve is more peaked than the normal curve, and is 
said to be leptokurtic. When they are less than 3, the curve has a 
flatter top than the normal curve, and is said to be platykurtic. 
The normal curve, and other curves with a 4 and 0 2 equal to 3, 
are said to be mcsokurtic. In the distribution of student heights, 
a a is 2.93. The curve is slightly flatter than the normal curve— 
slightly platykurtic. 

A very peaked curve has kurtosis greater than 3. As we 
flatten the curve the value of a 4 decreases, and when it has 
reached 3 the curve is mesokurtic. If we flatten it still more the 
curve becomes platykurtic. Ultimately, of course, the curve 
will flatten out entirely into a straight line, with the various 
frequency classes containing equal numbers of cases. This is 
what we have called a rectangular distribution (see Sec. 3.14, page 
43). The value of a 4 for a rectangular distribution depends on 
the number of frequency classes, being 1.8 when the number of 
classes is infinite, and approaching 1.8 rather rapidly for finite 
numbers of classes. Table 8.3 shows the values of a 4 for rec¬ 
tangular distributions with small numbers of classes. 

If we continue to push down the middle of the distribution 
still further, the value of a 4 will fall below that given in Table 
8.3, and the distribution will become U-shaped. We can, then, 
judge something of the shape of the frequency curve by the value 
of a a. If a 4 is greater than 3, the curve is more peaked than the 
normal curve. If the value of a 4 lies between 3 and the value 
given in Table 8.3, the curve is flatter than the normal curve, but 
still mound-shaped. If a 4 has the value given in Table 8.3, the 
distribution is rectangular. If a 4 has a value lower than that 
given in Table 8.3, the distribution is U-shaped. If the student 
does not have a copy of the table handy for reference, he can 
remember that in most actual frequency tables the critical value 
of a 4 which separates mound-shaped from U-shaped curves is 
between 1.75 and 1.8. 



208 


ELEMENTS OF STATISTICAL METHOD 


Table 8.3.— Values or a 4 in Rectangular Distributions with Various 
Numbers of Classes 


Number of 
Classes 

Value 

1 Of <X4 

Number of 
Classes 

Value 
of OC* 

1 

0.0000 

12 

1 7832 

2 

1 0000 

13 

1 7857 

3 

1 5000 

14 

1.7877 

4 

1.6400 

15 

1 7893 

5 

1.7000 

16 

1.79059 

6 

1.7314 

17 

1.79167 

7 

1.7500 

18 

1 79257 

8 

1 7619 

19 

1.79333 

9 

1 7700 

20 

1 793985 

10 

1 7758 

25 

1 796153 

11 

1.7800 

I 30 

1.797330 


8.8. Interpretation of Frequency Statistics.—In general we can 
describe any frequency distribution quite satisfactorily in terms 
of five statistical measures: 

1. The number of cases. 

2. An average, or measure of central tendency. 

3. A measure of dispersion. 

4. A measure of skewness. 

5. A measure of kurtosis. 


By far the commonest among the last four of these measures 
are the arithmetic mean, the standard deviation, and o 4 . 
Any given distribution has some definite five values of these 
measures, yet in some other distribution all five measures may be 
different. Such a number, used to describe a frequency distribu¬ 
tion, being a constant for any particular distribution but a vari¬ 
able as we shift from distribution to distribution we call a statistic 
of the distribution. We can say, then, that a frequency distri¬ 
bution can be described with reasonable accuracy in terms of five 
statistics. 

The student who remembers our use of the normal curve to 
describe distributions (see Secs. 7.8 and 7.9 and Fig. 7.7) may 
feel that the first three of these statistics are enough—that we 
can get a complete and satisfactory description of a distribution 
in terms of its frequency, arithmetic mean, and standard devia- 
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tion. Let us look, therefore, at Table 8.4. In this table the 
first column represents class limits, and each of the following 
four columns represents a set of frequencies. We have really 
combined in Table 8.4 four frequency tables for easy comparison. 
To distinguish them, we have labeled the frequencies of the first 
distribution /i, those of the second distribution / 2 , those of the 


Table 8.4.— Four Frequency Distributions Illustrating the Use of 
Common Frequency Statistics 


Class 

Limits 

fi 

fi 

/> 

/< 

20- 29 

1 




30- 39 

4 

2 



40- 49 

6 

5 

12 


50- 59 

8 

10 

12 

34 

60- 69 

10 

16 

12 

12 

70- 79 

16 

17 

12 

6 

80- 89 

18 

18 

12 

4 

90- 99 

16 

12 

12 

6 

100-109 

10 

10 

12 

12 

110-119 

8 

7 

12 

34 

120-129 

6 

5 

12 


130-139 

4 

3 



140-149 

1 

1 



150-159 


1 



160-169 


1 




third JZ) and those of the fourth / 4 . The table tells us, for exam¬ 
ple, that there are 16 cases in the 70-79 class in the first distri¬ 
bution, while there are 17 cases in the same class in the second 
distribution, 12 in the third, and 0 in the fourth. 

If the student will take the trouble to compute the arithmetic 
means and standard deviations of these distributions, he will find 
that they are exactly the same. In each case the arithmetic 
mean is 85 and the standard deviation is 25.8. Moreover, in 
each distribution the total number of cases is 108. On the basis 
of these three statistics alone, we should be tempted to say that 
the four distributions were identical. Yet these distributions 
are shown graphically in Fig. 8.3. The student will see immedi¬ 
ately that there is little similarity between the four cases in spite 
of the fact that they have exactly the same numbers of cases, 
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exactly the same arithmetic mean and exactly the same standard 
deviation. 

If, now, we compute the values of c* 3 and « 4 we uncover the 
differences at once. The entire five frequency statistics for the 
four cases are 



Case 1 

Case 2 

Case 3 

Case 4 

N 

108 

108 

108 

108 

X 

85 

85 

85 

85 

<r 

25.8 

25.8 

25.8 

25.8 

at 

0 

+0 57 

0 

0 

at 

2.565 

3.188 

1 770 i 

1 

1.23 


The values of a 3 and a 4 immediately serve to distinguish the 
distributions. We note that all distributions save number 2 
are symmetrical, while that one has moderate positive skewness. 
Case 2 is more peaked than the normal curve (leptokurtie); case 
1 is mound-shaped but flatter than the normal curve; case 3 
(since there are 9 classes in that distribution) is exactly rectan¬ 
gular, as we see when we compare its c* 4 value with that given in 
Table 8.3; and case 4 is U-shaped (since it was computed from 
data distributed in seven classes, and any value less than 1.75 
signifies that such a distribution is U-shaped). A comparison 
of the values of the statistics of these four distributions with 
their histograms in Fig. 8.3 will help the student to understand 
the interpretation. In case 2 the skewness is so marked that the 
arithmetic mean is thrown about 0.28 standard deviations to 
the right of the mode. The student will find it informative to 
compute the frequency statistics from the data in Table 8.4 as a 
check. 

In summarizing the description of any frequency distribution, 
then, we give five statistics. If we wish to summarize our 
description of the heights of Harvard students, we bring together 
the various statistics which we have computed as follows: 

N = 1000 

X = 175.335 (page 88) 

<r = 6.582 (page 143) 

as = —0.035 (page 196) 

<*4 = 2.93 (page 196) 
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r ^ f lT 


lTh-1 


i — i i _ ~h~h .■■■ 3 


The distribution is perhaps very slightly skewed in a negative 
direction, although it is practically symmetrical. It also is 
slightly flatter than the normal 
curve. The symmetrical distribu¬ 
tion of Fig. 7.1 has the values a 3 = 0 
and ou = 2.86. It is symmetrical 
and slightly platykurtic. The dis¬ 
tribution in Fig. 8.1 has the values 
0:3 — 0.357 and au = 2.794. It has 
positive skewness, with the arithme¬ 
tic mean larger than the mode, and 
lying about 0.178 standard deviations 
to the right of the mode. It is also 
slightly flat-topped, but definitely 
mound-shaped rather than rectangu¬ 
lar or U-shaped. 

8.9. The Pearsonian System of 
Frequency Curves.—When c* 3 differs 
very markedly from 0, or when a 4 
differs very markedly from 3, we know 
that our curve is not normal. Then 
we have to turn to some other 
kind of frequency curve to describe 
our data. Many such non-normal 
curves have been described, but 
among the most useful are the fami¬ 
lies of curves described by Karl Pear¬ 
son, the eminent English biometri¬ 
cian. Pearson's system includes 
frequency curves which are mound¬ 
shaped but asymmetrical, those 
which are J-shaped, those which are 

U-shaped, etc. A very large proportion of all actual frequency 
distributions can be described tolerably well by one or another 
of the 12 different classes of curves which Pearson describes. 
When we are fitting the normal curve to a distribution, we need 
know only the number of cases, the arithmetic mean, and the 
standard deviation (see Secs. 7.8 and 7.9). For most of Pearson's 
curves it is necessary, in addition, to know the values of a 3 and 
<* 4 , or, to use Pearson's symbolism, to know the values of 0i and 


ttd. 


Fin. 8 3. These four distri¬ 
butions nil have exactly the 
same number of cases, the 
same arithmetic mean, and 
the same standard deviation. 
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&2 where 

Pi — «s 2 

P 2 = a 4 

In deciding what type of curve to fit, it is also necessary to know 
the value of k 2 which is defined as follows: 

0i(0 2 ± 3) 2 _ 

* 2 4(4/3 2 - 3/3i)( 2/3 2 - 3/Sj - 6) 

Although there are a dozen classes of curves in the Pearson 
system, his Type III curve (of which the normal curve is a special 
case) is by far the most important. Fortunately it is also the 
easiest to fit. We shall confine our discussion of the Pearson 
curves to the Type III curve, and the student who wishes to 
investigate the other types is referred to the books suggested at 
the end of this chapter (see Sec. 8.13). 

8.10. Fitting Pearson’s Type III Curve.—Pearson’s Type Til 
curve may be fitted to any distribution in which 

2au — 3a 3 2 — 6 _ n 

~«7 +3 

or, if we use Pearson’s symbols, 

2/3* - 3/h - G 
02 + 3 

In the normal curve this expression will be equal to zero, and, 
in addition, a* will be equal to zero. The Type III curve covers 
all cases where the former expression is equal to zero, whether or 
not az equals zero: hence we see that the Type III curve will 
cover asymmetrical as well as symmetrical curves. 

The fitting of Type III curves is exactly parallel to the fitting 
of the normal curve which was explained and illustrated in Secs. 
7.8 and 7.9. We can fit Type III curves either by areas or by 
ordinates, and we make use of tables of ordinates and areas of 
skewed curves which are somewhat similar to the tables of areas 
and ordinates of the normal curve in Appendixes I and II, except 
that we now have to have separate entries for each different 
degree of skewness. Detailed tables showing the areas and ordi¬ 
nates have been published by Salvosa, 1 and condensed extracts 
l Luis K. Salvosa, Tables of Pearson’s Type III Function, Annals of 
Mathematical Statistics , Vol. 1, May, 1930, pp. 191 ff. 
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from these tables appear in the laboratory manual which accom¬ 
panies this text. 1 

We illustrate the fitting of the skewed curve in Table 8.5. 
The data in the first column are class marks, while those in the 
second column are frequencies. If we go through the processes 
explained earlier in this chapter, we find the following values for 
this table: 

N = 675 
X = 124.6 
= 15.49 
as = 0.408 
a 4 = 3.258 

When we test the distribution to see whether or not a Type III 
curve should be fitted, we discover 

2a 4 - 3as 2 - 6 2(3.258) - 3(0.408 2 ) - 6 

a 4 -b 3 ' 3.258 + 3 ' 

This expression gives a value of zero in Type III curves, and 
here the value is so nearly equal to zero that it seems in advance 
fairly safe to try that type of curve. In fact, practice will show 
that this curve can well be fitted even when there is a considerable 
departure from zero. 

Column 3 of Table 8.5 shows the distances of the class marks 
from the arithmetic mean, found by subtracting the value of the 
arithmetic mean (124.6) from each class mark. Column 4 is 
found by dividing each entry in column 3 by the standard devia¬ 
tion (15.49) to convert the distances into standard units. Col¬ 
umn 5 is found from the printed tables, similar to those in Appen¬ 
dix II. The entries here are from Salvosa’s detailed tables of 
ordinates of skewed curves, although approximately the same 
values could be found by interpolation in the tables of the author’s 
manual. The student is warned that these entries in column 5 
cannot be computed from or found in any material in this volume, 
but the process of finding them is parallel to the process of finding 
the figures in column 5 of Table 7.5, page 180, except that other 

1 Albert E. Waugh, “Laboratory Manual and Problems for Elements 
of Statistical Method,” 2d ed., Tables A3 and A4, McGraw-Hill Book Com¬ 
pany, Inc., New York, 1944. 
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tables are used. In the last column, we have the computed fre¬ 
quencies. These are found, as in Sec. 7.8, by multiplying the 
tabular values of column 5 by a constant equal to N{Ci)/a y 
which, in this problem, is 675(10)/15.49, or 436. In other words, 
each item in the last column is found by multiplying the corre¬ 
sponding item in column 5 by 436, and in any other problem the 
last column would always be found by multiplying the tabular 
values by a constant equal to N(Ci)/<7 . 


Table 8.5.— Fitting a Skewed Type 111 Curve by tiie Method of 

Ordinates 


Class 

Mark 

(X) 

Number of 
Cases 

(/) 

X 

x/a | 

i 

Tabular 

Value 

(y) 

Computed 

Frequency 

(/') 

184.5 

2 

59.9 

3 87 

0.0015 

0 65 

174 5 

2 

49 9 

3 22 

0 0062 

2.7 

164 5 

6 

39 9 

2 58 

0 0216 

9.4 

154.5 

28 

29 9 

1 93 

0 0647 

28 2 

144 5 

76 

19 9 

1.28 

0 1569 

68 4 

134 5 

126 

9 9 

0.64 

0 2918 

127 2 

124 5 

169 

- 0 1 

-0 006 

0 3980 

173 5 

114 5 

159 

-10 1 

-0 65 

0 3626 

158 1 

104 5 

82 

-20 1 

-1 30 

0 1923 

83 8 

94 5 

24 

-30 1 

-1 94 

0 0494 

21 5 

84 5 

1 

-40 1 

! 

-2 59 

0 0041 

1 8 


These computed frequencies in the last column are the fre¬ 
quencies for the corresponding Type III curve; that is, they are 
the frequencies which there would be in each class in a Type III 
curve when N was 675, X was 124.6, a was 15.49, and a 3 was 
+0.408. We note, for example, that while there really were L69 
cases in the class which centered at 124.5, we should have 
expected to find 173.5 cases in this class in a Type III curve. 
A comparison of the actual frequencies in column 2 with the 
computed frequencies in the last column will show that the actual 
distribution was very much like a Type III distribution; although 
the student is warned not to rely on such casual inspection to 
determine whether or not the correspondence between actual 
and computed frequencies is close. Even an experienced statis¬ 
tician would be unable to tell from superficial inspection how well 
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the two sets of data corresponded, and we shall see at a later 
point in this chapter (Sec. 8.12) how one may test the “goodness 
of fit” quantitatively. We have now gone far enough, however, 
so that the student should see that it is a relatively easy matter 
to fit a Type III curve from the prepared tables, following the 
same general system which we have already used for the normal 
curve. 1 

8.11. The Poisson Series.—In addition to the normal curve 
and the Type III curves, there is another whole set of frequency 
curves often used in advanced statistics called the Gram-Charlier 
series. This series we cannot cover here, but we shall take time 
to say a few words about another very useful type of frequency 
curve called the Poisson curve. 2 We have seen that the normal 
curve occurs in distributions that are subject to chance in which 
the result depends on a large number of causes, each of which 
is a 50-50 chance. If the causes are not 50-50 (that is, if 
p 7 * q our distribution is asymmetrical or skewed, as we 

saw in Sec. 8.5. Sometimes this skewness is moderate, when 
p and q are almost the same size, but if either p or q becomes very 
small the distribution takes on a marked skewness and the normal 
curve cannot be used to describe it with even approximate 
accuracy. Under such circumstances the Poisson distribution 
is sometimes useful. 

Let us start by looking back at the moments of probability 
distributions given in Sec. 8.5. Here we discover that, in such 
a distribution, we have the following values of the commoner 
statistics (see page 197). 

1 There are at least two respects in which the method of fitting Type III 
curves differs from the methods studied earlier for normal curves. In the 
first place, since the Type III curves are not symmetrical, we must keep 
track of the sign of our deviations from the arithmetic mean, since the 
height of the curve at a point 1.4 standard deviations above the mean will 
not equal the height at a point 1.4 standard deviations below the mean. 
And in the second {)lace, the tables are made up for positive values of on. 
If we fit a Type III distribution to a set of data in which is negative, we 
must reverse the table by selecting points in the table which deviate on the 
opposite side of the mean from those in our problem, taking, for example, 
— 1.7(7' in the table when our class mark is -fT.7cr. 

2 Technical discussion of the basic assumptions of the Poisson series can 
be found in Lucy Whitaker, On the Poisson Law of Small Numbers, Bio- 
metrika , Vol. 10, 1914-1915, pp. 36#“.; and R. A. Fisher, “Statistical Methods 
for Research Workers/’ Oliver & Boyd, Edinburgh and London, 1932. 
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X = np _ 

a — y/npq 


a a 

«4 


Q - P 
y/ npq 


npq 


6 

n 


+ 3 


Let us now assume that p has become very, very small, so that 
q is almost equal to 1. Yet let us suppose that we are dealing 
with a large enough number of cases so that np is a sensible 
quantity even though p is very small. Then we see right away 
that \/npq will be substantially the same as y/np (since q will 
be approximately equal to 1). Therefore under such circum¬ 
stances the value of the standard deviation will be approximately 
y/np or y/X. Similarly the numerator of the value of a z will 
become approximately 1 (since q will be approximately 1, and 
p will be too small to have much influence); hence the value of a z 
will be approximately 1/y/x, or 1/cr. Similarly, if n is very 
large the value of 6/n will be negligible, and our formula for a { 
will become approximately 3 + l/X. In such a distribution, 
then, we can state our statistics in terms of the following adjusted 
formulas: 

X — np 

a = VX 

1 1 

“ 3 “ vx " - 
* 4 = 3 + i 


It is at once evident that if the value of the arithmetic mean 
is known we can compute the values of all the other frequency 
statistics. This is one of the peculiarities of the Poisson dis¬ 
tribution—we need know only the arithmetic mean to fit the 
entire curve. It is a curve that is useful in describing data sub¬ 
ject to chance, in which the chance of occurrence is very, very 
small, but in which there are enough total observations so that 
the phenomenon does actually occur sometimes. Suppose, for 
example, that we were considering the number of ministers 
murdered in Chicago each year, tabulating the number of years 
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in which no minister was murdered, the number in which one was 
murdered, the number in which two were murdered, etc. It is 
probable that we should get a J-shaped distribution, with most 
of our years in the “ no-murder ” class, fewer in the *‘ one-murder ” 
class, and fewer and fewer years as the number of murders 
increased. Other hypothetical examples of this sort of distribu¬ 
tion are given in Sec. 3.13, where J-shaped distributions are 
discussed. Poisson series are not always J-shaped, but they are 
usually either J-shaped or mound-shaped and badly skewed in a 
positive direction. In such distributions it will often pay to try 
the Poisson curve, especially since it is unusually easy to fit. It 
may be wise to test our distribution in a preliminary manner to 
see, for example, if the standard deviation is approximately equal 
to the square root of the arithmetic mean or if <* 3 is roughly equal 
to the reciprocal of the standard deviation. 

We shall illustrate the fitting of this distribution with figures 
showing the rate at which vacancies have occurred in the U.S. 
Supreme Court from 1837 to 1950. 1 During this period the 
numbers of years in which various numbers of vacancies occurred 
were as follows: 

Number of Number 

Vacancies of Years 

0 68 

1 33 

2 11 

3 2 

Total .114 

From this table we compute the average number of vacancies, 
which is 0.535 per year. The standard deviation is 0.740, and the 
square root of the arithmetic mean is 0.731. The two values are 
roughly equal, and this, together with the J-shaped character of 
the distribution, suggests the possibility of fitting a Poisson distri¬ 
bution. To fit such a distribution we need no information other 
than the number of cases and the average. 2 The formula used is 

1 The data are suggested by W. Allen Wallis, The Poisson Distribution and 
the Supreme Court, Journal of The. American Statistical Association, Vol. 31, 
June, 1936, pp. 376/. 

2 The average used in fitting this series must always be computed by 
numbering the classes from 0 up. In this case they were naturally so 
numbered, but if other class limits had appeared, we should have had to 
substitute for them the numbers 0, 1, 2, 3, . . . , and compute the average 
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y 


- N 


A x e~ A 

x\ 


where y is the estimated frequency, N the number of cases, A the 
average occurrence computed as directed in footnote 2, page 217, 
e = 2.71828 approximately, and a;! is factorial x or the product of 
all positive integers from 1 to x. In our problem N is 114, A is 
0.535, and x takes successively the values 0, 1, 2, and 3. We 
then solve the equation for the corresponding values of y. 

If we write the equation as it appears when we insert the values 
for our present problem, we get 

0.535*(2.71828~° 636 ) 

y — 114---,-- 

* x! 

It will be easier to solve the equation if we put it in logarithmic 
form, thus 

log y = log 114 + x(log 0.535) - 0.535(log 2.71828) - log x! 

Substituting the logarithms of the values given we get 

log y = 2.05G90 - 0.271G5x - 0.23235 - log x\ 

« 1.82455 - 0.27165s - log x\ 

Remembering that x represents the various numbers of vacancies 
in the Supreme Court, we now give x successively the values of 
0, 1, 2, and 3 and solve our equation to find y, the frequency of 
occurrence. If first we let x equal 0, and recall that by conven¬ 
tion factorial zero is taken as 1 (that is, when x = 0, x\ = 1) we 
get 

log y — 1.82455 and y = GG.77 

In other words, we would have expected to have no vacancies in 
66.77 years out of every 114. 

If, now, we let x = 3 our equation becomes 

log y = 1.82455 - (0.27165)(3) - log 3! 

=* 1.82455 - 0.81495 - log 6 
= 0.23145 
y = 1.70 


from these substituted values. To make this clear we use here the letter A 
to represent the average so computed rather than our more usual symbol for 
the average. 
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Computing similarly the values of y which occur when x is 1 and 
when x is 2, we get the following results: 


Number of 

Number of Occurrences 

Vacancies 

Estimated 

Actual 

0 

l 

66.77 

68 

1 

35 72 

33 

2 

9 54 

11 

3 

1.70 

2 

Totals . . 

113 63 

114 


Tt will he seen that the Poisson distribution follows the original 
data very closely. 

If one is interested in the relative likelihood that various num¬ 
bers of vacancies will occur rather than in the actual numbers 
of years out of 114 in which each will occur (that is, if one wants 
relative probabilities rather than actual frequencies), one will 
omit the N in the formula. The remainder of the formula is then 
solved just as we have solved it, and it yields the desired proba¬ 
bilities. In our present case, triai will disclose that the probabili¬ 
ties are approximately as follows: 


Number of 
Vacancies 
0 
1 
2 
3 

Total 


Probability 
of Occurrence 
0 586 
0 313 
0 084 
0 015 
0 908 


The probability that more than three vacancies will occur in a 
year is found by subtracting this total, 0.998, from 1.000. In 
this case we find that this probability is 0.002. In other words, 
in two years out of every 1000 we might expect to find more than 
three vacancies occurring in the Supreme Court if the general 
underlying causal factors continue to operate as they did operate 
in the period studied. 

Although the work of fitting a Poisson distribution is very 
simple by means of logarithms, it can be speeded up somewhat 
if we use prepared tables. These tables sometimes show the 
proportion of the oases in each class for any value of the arith- 
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metic mean, so that after we have found the arithmetic mean all 
we have to do is look in the table to find what percentage of the 
cases will be in class 0, what percentage in class 1, etc. 1 Other 
tables merely show the proportion of cases in class 0 of a Poisson 
distribution with any given arithmetic mean, leaving the worker 
to compute the proportion in other classes from the following 
simple relationship. 2 

To find the proportion of cases in class 1 of a Poisson distribu¬ 
tion, multiply the proportion in class 0 by the arithmetic mean. 
To find the proportion in class 2, multiply the proportion in class 
1 by one-half the arithmetic mean. To find the proportion in 
class 3, multiply the proportion in class 2 by one-third the arith¬ 
metic mean. In general, to find the proportion of the cases in 
class ft, multiply the proportion in class n — 1 by 1/nth the 
arithmetic mean. 3 We see at once from these directions that if 
the arithmetic mean is smaller than 1, the number of cases in each 

1 For example, Karl Pearson, “Tables for Statisticians and Biometri¬ 
cians,” pp. 113-124, Cambridge University Press, London, 1914, gives the 
proportion of cases in each class of a Poisson series for various values of the 
arithmetic mean, by steps of one-tenth of a unit m the arithmetic mean. 
For example, it would give data for cases when the mean was 0.0 or 0.7, 
but not for 0.63. 

a For example, in the Laboratory Manual, which is designed for use with 
this text, we find immediately that when A is }4, the proportion of cases in 
class 0 is 60.65 per cent. This table gives values of the arithmetic mean 
to the nearest hundredth rather than merely to the nearest tenth. See 
Waugh, op. cil.y Table A 22. 

8 The student who is interested in mathematics will see the reason for this 
rule at once. The formula for the proportion of cases in any class of a 
Poisson distribution is 

A x e~ A 

x\ 

For class 0 this would give 

A°e~ A e~ A 
0 ! ' * 1 

Similarly, for class 1 we get 

Ae~ A 

1 

For clasB 2 we get 

A*e' A 

( 1 )( 2 ) 

The proportion in each class is found by multiplying the proportion in the 
preceding class by A/n. 



MOMENTS, FREQUENCY CURVES , CHI-SQUARE TEST 221 


class will be smaller than that in the preceding class. If the 
value of the arithmetic mean exceeds 1, class 1 will be larger than 
class 0; if the arithmetic mean exceeds 2, the second class will 
be larger than the first, etc. Thus the Poisson distribution may 
be mound-shaped, but will be J-shaped when A is less than 1. 
Also we see that the Poisson distribution approaches the normal 
distribution as a limit when the value of A gets very large. We 
have seen at the beginning of this section that in a Poisson dis¬ 
tribution a 3 = 1VZ, and as the arithmetic mean gets very 
large this will make a$ approach zero. This indicates that the 
curve is becoming more and more symmetrical. Likewise, we 
saw that in such a distribution a 4 = 3 + l/X. As the arith¬ 
metic mean increases, this value will approach closer and closer 
to 3, so that when the arithmetic mean of a Poisson distribution is 
large, a 3 approaches 0 and a 4 approaches 3, which means that 
the distribution approaches the normal curve. The student 
must remember in fitting these distributions to compute the 
arithmetic mean in the special way indicated, with the first class 
given a value of zero, the next a value of 1, etc. 

8.12. Goodness of Fit and the Chi-square Test.—During this 
and the preceding chapter, we have been computing various 
sorts of curves to represent frequency distributions—normal 
curves, Type III curves, Poisson curves, etc. We may well ask 
with regard to any one of these curves just how well or how poorly 
it describes the distribution to which we fit it. W^ en we wish 
compare actual data with hypothetical data, to see whether or 
not the hypothesis is reasonable in the light of the actual data, 
one of the most useful methods involves the application of what 
is called the chi-square test, or the x 2 test. When we apply 
this test, we are always making a comparison between an actual 
occurrence and a hypothetical occurrence. For example, we 
might say that on the basis of theory we should expect a penny 
to come up “ heads” 28 times in 56 throws. If it did actually 
come up “heads” 35 times, we should compare the actual occur¬ 
rence (35 heads) with the expected or theoretical occurrence 
(28 heads), and we should try to decide whether the actual dif¬ 
fered from the expected enough to force us to abandon the 
hypothesis that the penny could be expected to come up “ heads” 
half the time. 

In the case of frequency curves, our usual hypothesis is that 
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the data are fundamentally normal or that they are fundamen¬ 
tally of the skewed Type III form, or that they are fundamen¬ 
tally of the Poisson type, etc. We realize that any finite number 
of cases, say 1000 cases, may not conform exactly to the hypothe¬ 
sis, just as in 1000 throws of a penny we might not get exactly 
500 heads. But we are interested in comparing what we did get 
with what we would have got if the hypothesis had been correct. 

Perhaps the idea will be simpler if we illustrate it with data 
that we have already studied. In the preceding chapter, in Sec. 
7.8, we fitted a normal curve to data describing the distribution 
of students’ heights. Evidently it was then our hypothesis that 
the heights of students were normally distributed. When we 
got through, we found that the students’ heights had not been 
distributed exactly normally, but that there were differences 
between the actual figures and those which would have been 
expected on the basis of our hypothesis that the data were normal. 
Our findings are given in the first three columns of Table 8.G. 
These are the first, second, and last columns of Table 7.5. 

In applying the chi-square test, if we find some classes with 
very few cases (as we often do near the ends of a mound-shaped 
distribution) we lump two or three classes together. The reason 
for this is obvious. If I want to know whether a penny is well- 
balanced or not, I know better than to try to decide on the basis 
of only two or three tosses. After all, if I toss a penny only 
once it will come up heads or tails 100 per cent of the time, not 
50 per cent of the time. Similarly, if some classes have less than 
10 cases it is considered good practice to lump enough classes 
together so that there will be at least 10 cases in each group. 
Thus in Table 8.6 we have lumped together the top two classes, 
giving us 12 cases, and the bottom four classes, giving us 28 
cases. We have then, in the fourth column, compared the accu¬ 
racy of our estimates by subtracting the computed from the 
actual values. These figures in the fourth column show us the 
amounts of our errors. For example, if the distribution of 
heights had actually been exactly normal, there would have been 
141 cases in the 180 class, whereas there were but 125. Thus the 
error was 16 cases; —16 because the actual distribution was 16 
cases short. 

It would be unwise to judge the amount of our error by adding 
these errors of column 4 and using the sum, since every error 
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might be large, yet the positive and negative signs might balance, 
giving a small sum. Therefore we use the same procedure that 
we used under similar circumstances when computing the stand¬ 
ard deviation; that is, we square the deviations. This gives 
us the fifth column. Yet even here we do not have a good meas¬ 
ure, because surely we should feel that to come within 10 heads 


Table 8.6.—Applying the Chi-square Test to the Distribution of 
Student Heights 


Class 

Mark 

(X) 

Observed 

Frequency 

(/) 

Expected 

Frequency 

(/') 

v-r) 

(/-/')* 

(/ - ry/r 

156 

4 

2.5£ 

0.9 

0.81 

0.073 

159 

8 

8.6J 




162 

26 

23.6 

2 4 ! 

5.76 

0.244 

165 

53 

53.7 

- 0.7 ! 

0.49 

0.009 

168 

89 

97.9 

- 8.9 

79 21 

0.799 

171 

146 

146.8 

- 0.8 

0.64 

0.004 

174 

188 

177.7 

10.3 

106.09 

0.597 

177 

181 

175.3 

5.7 

32.49 

0.185 

180 

125 

141.0 

— 16.0 

256.00 

1.816 

183 

92 

91.4 

0.6 

0.36 

0.004 

186 

60 

48.8 

11.2 

125.44 

2.570 

189 

22 

20 9^ 




192 

4 

7A \ 

- 2.9 

8.41 

0.272 

195 

1 

2.H 




198 

1 

0.5/ 




Totals. 

1000 

998 2 



6.573 


of the expected amount when we tossed a penny a million times 
was very close, while to miss it by 10 heads when we tossed the 
penny 22 times would be very far from expectation. We should 
naturally want to divide by the expected amount to get the 
result in percentage form. This is what we do in the last column 
of Table 8.5. Each figure in this column is found by dividing 
the corresponding figure in the preceding column by the corre¬ 
sponding figure in the third column. When we add these figures 
in the last column, we get the value of x 2 , which in this case is 
6.573. 

To interpret this answer, we have to know how large a value of 
X 2 can be expected. Obviously this will depend on the number 
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of items in the column. In our illustrative case there are 11 
items in the column. But also in our illustrative case we were 
not content with any random normal curve. On the contrary, 
we insisted on fitting a normal curve that had the same total num¬ 
ber of cases, the same arithmetic mean, and the same standard 
deviation as our original figures. Thus we forced our normal 
curve to agree with our original data in three particulars, and we 
say that we reduced the number of “degrees of freedom” by 3, 
from the original 11 items to 8. Therefore we let n' equal 8 for 
our particular example. 1 Let us now consult Fig. 8.4, finding 
our value of chi square (6.573) on the vertical axis at the left 
and our value of n' (which is 8) on the base line. We discover 
that the point on this diagram at the right of x 2 = 6.573 and 
directly above n' = 8 lies between the line labeled P — 0.50 
and the one labeled P = 0.70. In fact, if we interpolate roughly 
between these two lines, we might say that this point represents 
a value of 0.58, since it seems to be just barely closer to the 0.50 
line than to the 0.70 line. 2 This is the probability that we should 

1 In applying the chi-square test, we always compare a number of actual 
frequencies (here 11 of them) with the same number of theoretical fre¬ 
quencies. The number of degrees of freedom, n\ is the number of classes the 
frequencies of which could be filled in at random without violating any of the 
totals, subtotals, etc. For example, if we take the case of Table 8.5, we want 
to make the total frequency equal 1000 to correspond with our original 
data. We could put numbers arbitrarily into any 10 of the 11 classes, but 
having filled up those 10 classes there is only one number that can be put 
in the 11th class to give us the right total. We do not have any “freedom” 
in making our last entry. There are but 10 “degrees of freedom.” And 
when we stipulate that the mean must be 175.335 cm. and that the standard 
deviation must be 6.582 cm., we lose two more degrees of freedom, leaving 
us n' » 8. In general, when we fit normal curves, ri is smaller by 3 than the 
number of classes. When we fit a Type III curve, we must also use the 
skewness; so we lose another degree of freedom, and n f is smaller by 4 than 
the number of classes. When we fit a Poisson curve, we use only the total 
frequency and the arithmetic mean; so n' is smaller by 2 than the number of 
classes. For a complete understanding of degrees of freedom, the student 
will find it necessary to consult some more advanced work, but these rules of 
thumb will be reasonably adequate for the types of work here described. 

2 The value of P corresponding to any values of n' and y} can be computed, 
but the procedure is laborious. The results have fortunately been tabu¬ 
lated, aB, for example, in Pearson, op. tit., p. 26, and in Fisher and Yates, 
“Statistical Tables for Biological, Agricultural, and Medical Research,” 
p. 27, Oliver & Boyd, Edinburgh and London, 1938. A copy of the latter 
table appears in Waugh, op. cit. Linear interpolation in such tables gives 
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get a fit as bad as or worse than the fit of Table 8.6 by pure chance 
in drawing a sample of 1000 cases from a universe which was 
really normally distributed. To put this in other terms, since 



Fig. 8.4. Values of P corresponding to various values of chi square and 
various degrees of freedom. 


our value of P is 0.58 in this problem, we can say that 58 per 
cent of the time we should get fits as bad as this one or worse 
just by chance even if students’ heights in the universe are really 
normally distributed. This, then, is a reasonably good fit; it 

0.585 for the value of P, but one can make a visual interpolation on Figs. 8.4 
and 8.5 with sufficient accuracy, and we shall continue here to use the rough 
value of 0.58. 
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is by no means unreasonable to assume that the distribution of 
heights is really normal, and that the departures from normality 
among our particular 1000 students are merely chance, haphazard 
variations. 

If our value of P turns out to be very small, it means that our 
fit was very bad. Suppose, for example, that we had a case 
where x 2 was 24 and n' was 3. We see from the chart that we 
should get a fit as bad as this far less than one time in a thousand 
by chance. If we look up the values in the tables, we find that 
such a poor fit would occur by chance only about 25 times in 
a million. In other words, if we had a very large number of 
students, and kept on trying again and again taking samples of 
1000 students each, the samples would not always be exactly 
normal even if the larger group from which they were selected 
was exactly normal—yet only 25 times in a million trials would 
any sample drawn at random differ so much from normal as this. 
Now if the chances are only 25 in a million that this sample came 
from a normal distribution, we are fairly safe in assuming that 
the actual distribution is not normal. When P is very small, 
we decide that our hypothesis is not tenable. Our hypothesis 
in this case was that the distribution was normal. We actually 
did get a value of P which led us to believe that the heights might 
well be distributed normally (.P — 0.58). 

It might be wise to point out here that we can also get a value 
of P which is so large that we look upon it with suspicion. We 
know that when we toss two good pennies the chances are that 
half of them will be heads and half tails. Yet if someone tosses 
two pennies time after time, and always gets one head and one 
tail (never getting two heads or two tails), we decide that there 
must be something wrong. It is “too good to be true.” If 
you saw a man toss two pennies 500 times, and every single time 
he got one head and one tail, you would (or should) raise some 
question about it in your mind. Similarly, if every single class 
in a frequency table has exactly the expected number of cases, 
or almost exactly the expected number, we are suspicious. Thus 
if we got values of x 2 and n! which yielded a value of P = 0.999, 
this would mean that we should get a better fit only once in a 
thousand by chance. Really what we look for in applying the 
chi-square test is values of P somewhere around 0.5. Some 
departure from this is common and raises no question in our 
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minds. Usually if P is between 0.1 and 0.9, we accept it as 
meaning that our hypothesis is probably tenable. When we get 
outside these bounds, we begin, perhaps, to get the least bit 
suspicious of our hypothesis, and we prefer to try more cases to 
make sure. But we do not usually throw our hypothesis out 
altogether unless P is smaller than 0.01, and many statisticians 
would insist on using an even smaller value of P. Neither do 
we decide that the data are altogether too close to expectation 
to believe that it happened by chance unless P is greater than 
0.99. 

Let us apply the chi-square test to one other set of data and 
interpret the results. In the preceding section, we fitted a 
Poisson distribution to data on Supreme Court vacancies (see 
page 219). The data appear again in Table 8.7. This time 


Table 8.7.— Applying the Chi-square Test to a Poisson Distribution 


X 

/ 

r 

(/-/') 

(/-/t 

(/ - /t//' 

0 

(18 

66 77 

1 23 

1 51 

0.023 

1 

33 

35 72 

-2 72 

7 40 

0 207 

2 

11 

9 54) 




3 

2 

1 70/ 

1 76 

3 10 

0 276 

Totals 

114 

113.63 



0 506 


we have lumped together the last two classes to get at least 10 
cases in each class. Again we find in the fourth column the 
differences between actual and expected, and in the fifth column 
we find the squares of these differences. Then we divide each 
item of column 5 by the corresponding figure in column 3 to get 
column 6. The sum of this last column is chi square. In this 
Supreme Court problem, x 2 = 0.506. When we look at Fig. 8.4, 
we find it hard to tell what the value of P is with very much 
accuracy when n' is 1 and x 2 is as small as 0.506, so we turn to 
Fig. 8.5. This latter figure is merely the lower left-hand comer 
of Fig. 8.4 enlarged. In Fig. 8.5, we locate the point on the 
diagram which is directly at the right of the value 0.506 on the 
vertical scale and directly above the figure 1 on the horizontal 
scale. We see that this point falls almost exactly on the 0.50 
line. Therefore we know that P has a value of approximately 
0.50. This tells us that if Supreme Court vacancies were really 
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distributed in a Poisson distribution, and if we selected cases at 
random, we should get results that fitted worse than these about 
50 per cent of the time. The Poisson curve, then, gives a reason¬ 
able fit to the data of Table 8.7—just about what we might 
expect to get in samples from a distribution that was really an 
exact Poisson distribution. 



Fig. 8.5. Values of P corresponding to various values of chi square and 
various degrees of freedom. 


We can now summarize the rules for applying the chi-square test to a 
frequency distribution as follows: 

1. Set down in a column the actual frequencies of the classes in the 
frequency table. 

2. Set down beside them the corresponding frequencies that would be 
expected if the distribution were normal, skewed Type III, Poisson, or 
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whatever your hypothesis calls for. You now have one column of actual 
frequencies and one column of computed or estimated frequencies. 

3. If any of the classes of actual frequencies contain less than 10 cases, 
add them to adjacent classes until no class contains less than 10 cases. 

4. Subtract each computed frequency from the corresponding actual 
frequency. 

5. Square the differences just obtained. 

6. Divide each of these squares by the corresponding computed fre¬ 
quency. 

7. Add the quotients just obtained. The sum is x 2 - 

8. Find the number of degrees of freedom by subtracting from the 
number of classes actually used (that is, from the number of entries in the 
last column of your table) the following number: 2 for a Poisson series, 
3 for a normal curve, or 4 for a skewed Type III curve. The remainder 
is n', or the number of degrees of freedom. 

9. In Fig. 8.4 or 8.5, find the point that bes vertically above your value 
of n' found in step 8 and at the right of the value of x 2 found in step 7. Read 
from the lines in the figure the axiproximate value of P. This value of P 
may also be found from prepared tables appearing in various manuals for 
statisticians. 

10. The value of P found in the preceding step is the probability that you 
would get by pure chance a worse fit than you did if your hypothesis had 
been correct. If the value of P is very small, it means that your hypothesis 
is probably incorrect If the value of P is very large, it means that the 
data are suspiciously close to those expected, and that they have probably 
been computed rather than observed, or in some way adjusted to get them so 
close to expectation. The value of P can be as large as 1.00 and as small 
as 0.00. Values between about 0.10 and 0.90 should lead you to believe 
that there is no reason for abandoning your original hypothesis. 

While the chi-square test is, in practice, ordinarily applied in accordance 
with our illustrations, there are some theoretical advantages where the 
number of cases is very large and the data arc continuous in classifying the 
original data in a frequency table with unequal class intervals so chosen that 
the frequencies in the various classes are equal. 1 This method requires the 
use of many more classes than are ordinarily needed for other purposes, 
running often to 50 or more; and since unequal class intervals have serious 
disadvantages in other ways (sec Chap. Ill) the statistician will be apt to 
utilize his data in the more familiar form, with equal class intervals and 
unequal class frequencies. 

8.13. Suggestions for Further Reading.— The student who wishes to 
learn more about moments would do well to read Chap. IV of Part I, John 

1 See H. B. Mann and A. Wald, On the Choice of the Number of Intervals 
in the Application of the Chi-square Test, Annals of Mathematical Statistics , 
September, 1942; and C. Arthur Williams, Jr., On the Choice of the Number 
and Width of Classes for the Chi-square Test of Goodness of Fit, Journal 
of the American Statistical Association , Vol. 45, No. 249, March, 1950, pp. 
77-86. 
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F. Kenney, “Mathematics of Statistics,” D. Van Nostrand Company, Inc., 
1939; or Chap. 9 in G. Udny Yule and M. G. Kendall, “An Introduction 
to the Theory of Statistics,” Charles Griffin & Co., Ltd., London, 1937. 
A discussion of Sheppard's corrections can be found in H. C. Carver's 
article on Frequency Curves which is Chap. VII of the “Handbook of 
Mathematical Statistics,” edited by H. L. Iiictz, Houghton Mifflin Com¬ 
pany, Boston, 1924. Pearson's system of frequency curves, including both 
the Type III and other types, is discussed in Chap. IV of W. Palin Elderton, 
“Frequency Curves and Correlation,” Layton, London, 1927. A much 
shorter and simpler treatment is found in Chap. Ill of C. B. Davenport and 
Merle P. Ekas, “Statistical Methods in Biology, Medicine and Psychology,” 
John Wiley & Sons, Inc., New York, 1936. References to further discussions 
of the Poisson curve are given in the footnote on page 215. To these wc 
might add Henry Lewis Rietz, “Mathematical Statistics,” pp. 39-45, The 
Open Court Publishing Company, La Salle, HI., 1927. The chi-square test 
was originated by Karl Pearson, and the student who wishes to investigate 
it further would do well to read his original article, On the Criterion That 
a Given System of Deviations from the Probable in the Case of Correlated 
Variables Is Such That It Can Reasonably Be Supposed to Have Arisen 
from Randon Sampling, Philosophical Magazine , 5th series, Vol. 50, 1900, 
pp. 157#. Also helpful is R. A. Fisher, “Statistical Methods for Research 
Workers,” Chap. IV, 3d ed., Oliver & Boyd, Edinburgh and London, 1930 
For a discussion of the use of another kind of probability paper see F. C 
Martin and D. II. Leavens, A New Grid for Fitting a Normal Probability 
Curve to a Given Frequency Distribution, Journal of the American Statistical 
Association , Vol. 26, new series No. 174, June, 1931, pp. 178#. 

EXERCISES 

1. Compute the first four moments of the data of Table 5.9, page 124. 

2. Table 8.8 shows the number of Kansas towns having various numbers 
of cream stations. 1 The distribution is obviously skewed Determine by 
inspection whether the skewness is positive or negative Verify by com¬ 
puting each of the measures of skewness described m Sec. 8.6. 

3. Fit a Poisson curve to the data of Table 8.8. 

4. Test the goodness of fit by chi square for your results in Exercise 3. 
Interpret your results. Does a Poisson curve give a good fit whenever one 
has a J-shaped distribution? 

5. Fit a normal curve to the data of Table 8.5, page 214. Apply the 
chi-square test, and interpret your results. 

6 . Apply the chi-square test to the data of Table 8.5, page 214, using the 
Type III curve. Interpret your results. 

7. Comparing your results on Exercises 5 and 6, which fits the data of 
Table 8.5 better, a normal curve or a skewed Type III curve? 

8 . Apply the Charlier check to your computations of Exercise 1 . 

9. Apply Sheppard’s corrections to your computations of Exercise 1. 

l From Theodore Macklin, “Efficient Marketing for Agriculture,” 
p. 346, The Macmillan Company, New York, 1922. 



MOMENTS , FREQUENCY CURVES , CHI-SQUARE TEST 231 


10. Fit a type III curve to the data of Table 5.9, page 124. It will be 
necessary to use prepared tables of ordinates of the Type III curve. 

11. Describe as well as you can without having seen it the characteristics 
of a frequency distribution which has the following frequency statistics: 

X - 17 5 
or = 2.7 

a* =* —0.9 
«4 = 3.2 

12. Explain why it is that we are suspicious of very large values of P in a 
chi-square problem; that is, what sort of things other than chance might 
account for a value of P of 0.9996? 

Table 8.8.—Numbers of Kansas Towns Having Various Numbers of 
Cream Stations 


Number of 

Number 

Cream 

of 

Stations 

Towns 

1 

282 

2 

240 

3 

151 

4 

101 

5 

50 

6 

11 

7 

8 

8 

2 

9 

1 


13. In applying the chi-square test, does any value of P tell us that our 
original hypothesis was correct? What will be true of the value of P if 
our original hypothesis was correct? 

14. On page 219 we computed various numbers of Supreme Court vacan¬ 
cies, using the formula for the Poisson curve. Later we found that, after 
computing the first of these, the others could have been computed easily 
by proportion. Check the results of page 219 by using the proportions of 
page 220. 

15. Consider the following five values: 1, 2, 3, 4, 5. They are obviously 
distributed symmetrically. Compute the first, thud, and fifth moments of 
this distribution and verify the statement made on page 190 that the odd 
moments of symmetrical distributions equal zero. Compute the second 
and fourth moments and discover why they do not equal zero Use the 
formulas given on page 188. 

16. Why is it true, as stated on page 192, that a\ = 0 and «i ® 1 in 
every distribution? 

17. The statement is made on page 203 that Bowley’s measure of skew- 
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ness cannot be greater than + 1 or less than — 1. Under what circumstances 
would it equal l? —1? 0? 

18 . Compute Sk. for the heights of Harvard students, using Kelley’s 
method based on percentiles (page 204). 

19 . Find the mode of the data of Table 5.9, page 124, by the method 
described on page 205. 

90. In Sec. 8.5, we evaluated the terms of the point binomial (q + p) 8 
when p was equal to 0.3. The resulting skewed binomial appears at the 
lower left of Fig. 8.1. Evaluate the terms of the point binomial (g -f- p) 8 
when p = 0.2. Plot your results, and compare them with the appropriate 
section of Fig. 8.1. 

21 . Suppose that you have fitted some sort of frequency curve to the 
data of a frequency table. The original table showed the data divided 
into 12 classes. You compare your computed frequencies with the actual 
frequencies, and apply the chi-square test. You discover that x a =** 15. 
Interpret this result, using either a table of x a or the charts of Figs. 8.4 
and 8.5. 



CHAPTER IX 


MEASURES OF RELIABILITY 

Occasionally a statistician works on a problem of such a nature 
that he can study all the existing facts—all the data are at his 
command. For example, if we wish to ascertain the average 
length of the terms of past Presidents of the United States, we 
can get data on each and every man who ever occupied the Presis 
dential chair. Thus we can be sure that our average represente 
the facts (at least if the original data were accurate and if w~ 
made no mistakes in computation). 

But suppose that we wish to discover the average yield per 
acre of potatoes in Maine, the average height of college students 
in the United States, the average weight of male babies at birth, 
or the average temperature on Aug. 7 in Duluth. The problem 
is then somewhat different. The chances are that we cannot get 
figures on every acre of potatoes in Maine or on every college 
student in the United States or on every male baby born or on 
every August temperature back to the beginning of time. Some 
figures we can usually obtain; but almost never can the statis¬ 
tician get figures on every occurrence of the event. A complete 
enumeration may be too expensive, or it may be entirely impos¬ 
sible regardless of expense, as it would be in the case of Duluth 
temperatures. 

9 . 1 . Sample and Universe.—When it is impossible to get com¬ 
plete data (and it almost never is possible), the statistician finds it 
necessary .to fall .back on a sample. The entire body of data 
which describe every occurrence of the event which ever existed 
is called the universe. For example, if we take the problem of 
determining the average birth weight of male babies, the universe 
would consist of the weights of all babies who were ever born 
(that is, of all male babies, or of all babies of the kind in which 
we were interested). The sample would consist only of those 
weights of which we actually had records. In such a study it 
is probable that the sample would be relatively small as compared 
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with the siae-ofHhe. Uttiyersg; that is, the figures usually give n 
on average weights of babies are based on observations of a num¬ 
ber of births which is small when compared with the total number 
of births. 

What is actually done, of course, is to weigh a relatively small 
number of babies (a few thousand at the most) and toj^ttl^eir 
average weight the average weight of babies. Similarly we take 
temperature records in Duluth for a relatively few years (surely 
less than a century) and call the average temperature for these 
years the average Duluth temperature. We find the yield of 
potatoes on each of a few hundred acres in Maine and call the 
average of these figures the average yield per acre in Maine. 
Thus the statistician studies the characteristics of a sample, and 
then imputes the same characteristics to the universe, study, 
the heights of 1000 college students. We find the average height 
of these students, the dispersion in their heights, the skewness 
in their heights, etc. We then ascribe the same average height, 
the same dispersion of heights, and the same skewness of heights 
to students in general. 

This habit of studying the peculiarities of a sample and attrib¬ 
uting the same peculiarities to the universe seems all the more 
peculiar when we stop to realize that, if we were to take another 
group of 1000 students selected at random, they would almost 
certainly not have exactly the same average height as did the 
first group measured. If we were to take 1000 such groups of 
students (1000 groups of 1000, each selected at random) we should 
find variations in the averages of the groups. Some groups would 
Have higher average heights; some would have lower. Likewise 
in some groups the standard deviation of heights would be larger 
than in others, and each of the other measures by which we 
describe the group of heights would vary from sample to sample. 

We are usually interested in the characteristics of the universe. 
We study past potato prices mainly because we are interested in 
future potato prices. We study past scholastic records primarily 
because we want to know what to expect in the future. As 
Professor Frank Knight has pointed out, the scientist who dissects 
a dead dog does it because he is interested in live dogs—he cuts 
open a dog not because he is interested in that dog but because he 
is interested in the universe of dogs (or the universe of mammals 
or in life itself). Now if we must expect to find changes in our 
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answers whenever we change our sample, what faith can be put 
in our conclusions? How can we ascribe to the universe the char¬ 
acteristics of a sample when we are certain that other samples 
would have yielded other characteristics? How do we dare say 
that The average male baby weighs 7.6 lb. at birth when we have 
weighed but a few thousand babies and when we realize that the 
average weight of a few thousand other babies would probably be 
somewhat different? 

Let us start by noting that if we know the facts about a uni¬ 
verse we can tell something about the samples which could be 
drawn from it. For example, suppose our universe consists of the 
90 numbers in the array given in Sec. 4.4, page 67. Suppose 
we were to draw at random 10 of these 90 numbers and compute 
the average of the 10 items in our sample. There is no telling 
exactly what that average will be, since there are many, many 
different combinations of 10 numbers which could be drawn from 
this “universe” of 90 numbers. In fact, the student may be 
surprised to discover that over 7 trillion different samples of 10 
items can be selected from 90 numbers! 1 But while we cannot 
tell which 10 numbers will be included in the sample, we can at 
once tell some things about the sample. The average in the 
sample surely cannot be greater than 179.9, since if our sample 
happened to include the 10 largest items in the universe their 
total would be 1799 and their average would be 179.9. Simi¬ 
larly, if our sample happened to include the 10 smallest items in 
the universe, the total would be 637 and the average would be 
03.7. Thus we can at least set some limits. Of the total 7 
trillion possible samples, one will have an average as large as 
179.9 and one as small as 63.7. To be sure, the chances of our 
getting either of these averages are very small. It could happen 
by chance, but it would happen by chance only once in 7 trillion 
trials. If we did get such an average in a sample, it could have 
come out of a universe like this one, but it probably would not 
have. In fact, if we were to say that such a sample was not 
selected by chance from this universe we would be wrong only 
once in 7 trillion times, which would not be bad for a statistician! 

1 Strictly speaking this is true only if none of the original 90 numbers is a 
duplicate, and our “universe” does contain a few numbers which are 
repeated. Despite these repetitions, the number of different samples of 10 
which could be drawn even from this small universe is almost limitless. 
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We would be likely to conclude, if we got such a sample, that it 
either came from some other universe, or that, if it came from 
this universe, it was carefully selected rather than having been 
selected by chance. 

It would be possible, given the data in our universe, to com¬ 
pute the probability of getting a sample with an arithmetic 
average of any other size. The probability of getting averages 
above 179.9 or below 03.7 is zero. Such averages are impossible. 
But if we wanted to find the probability of getting a sample with 
an average of, say, 100, we would have to find how many different 
combinations of 10 numbers selected from this universe would 
yield a total of 1000, and comparing this number of samples with 
the total possible 7 trillion samples we would discover the pro¬ 
portion of samples selected at random which would have means 
of 100. The inquisitive student will find immediately by experi¬ 
mentation that there are many, many samples of 10 items which 
can be drawn from this universe of 90 cases, each sample yielding 
an average of 100. 

9.2. Standard Error of the Arithmetic Mean.—In actual prac¬ 
tice the statistician knows certain facts with regard to the sample 
which he has studied, and wishes to draw inferences as to the 
characteristics of an unknown universe from which the sample 
was selected. Let us continue for the moment, however, to con¬ 
sider the inverse problem, in which we know the facts concerning 
the universe and wish to determine what kinds of samples can be 
drawn from it, and the probabilities of drawing various ones of 
them. Although it is true that we cannot be sure that the aver¬ 
age weight of any particular group of 1000 babies is equal to the 
average weight of all babies, it can be demonstrated 1 that, if we 
took an infinite number of samples of 1000 babies each and calcu¬ 
lated the mean of each of these samples, the average of these 
means of samples would be equal to the average weight of all 
babies, and the standard deviation of the means of these samples 
would be equal to 


<Tj = c x 




U - 8 
S(U - 1) 


where <75 is the standard deviation of the means of the samples, oy 

1 C. H. Richardson, “An Introduction to Statistical Analysis,” pp. 259- 
260, Harcourt, Brace and Company, Inc., New York, 
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is the of the weighi^of ail the babies in the 

universe, U is the number of cases in the universe, and S is the 
number of cases In the sample. Since it is usually true that U 
is tremendouBly large when compared with £, we shall not be led 
far astray in tliLaut ill size,.. If we do this we 

simplify the formula greatly, obtaining the following: 



Or, since S is the number of cases studied, and since this is usually 
represented by the letter N } we have 




y/N 


It will be noted that the standard deviation giyen in the numer¬ 
ator of this fraction is the standard deviation of the weights of all 
babies in the universe. Thisjs, in any actual problem, unknown. 
We can discover the standard deviation oFtbe weights of the 
babies in the sample, but there is no way of knowing the facts 
relative to the universe. As a matter of practice, in the absence 
of the necessary data describing the universe, we do assume that 
the standard deviation of the universe is equal to that of the „ 
sample. It has been shown empirically that the error made in ■ 
assuming this is not great. It does, however, make our conclu- I 
sions approximate rather than exact. 

I If we can make this assumption that the standard deviation 
jof the weights in the universe is equal to the standard deviation of 
{weights in the sample, we can then make a definite statement 
{relative to the distribution of the means of samples. Let us take 
the case of students’ heights which we have been discussing. We 
found that the standard deviation of the heights in the sample was 
6.58 cm. (page 143). The number.of cases studied. wm.lQOO. 
Now if we can assume that the standard deviation of the heights 
of all college students is 6.58 cm. (which is probably better than 
guessing at the standard deviation in the universe, but which is 
nevertheless probably not exactly accurate), we can make a state¬ 
ment about the distribution of means of samples of 1000 students. 
If we substitute in the last formula, we have 




Vn 


6.58 

V^IOOO 


= 0.2108 
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Let us see what this means. If we took many, many samples 
of 1000 students each, the average heights would not be the same 
from sample to sample. Some averages would be larger than 
others. But the standard deviation of these averages would be 
0.208 cm.; that is, about two-thirds of all the averages (actually 
68.27 per cent of them) would be within 0.208 cm. of the average 
height of all students in the universe. About 95 per cent of all 
the samples would have means within twice this distance, or 
within 0.416 cm. of the mean of all student heights in the uni¬ 
verse. And practically never should we get a sample whose 
mean differed from the mean in the universe by more than 3 
(0.208) = 0.624 cm. 

In our sample of 1000 students we found an average height of 
175.335 cm. (page 88). We do not know that this is the average 
height of all students; other samples would give other means. 
But on our assumptions we know that two-thirds of all these other 
means will be within 0.208 cm. of the mean of the universe. It 
is therefore true that the chances are 2 out of 3 that this mean is 
within 0.208 cm. of the mean of the universe. Conversely, it is 
true that the chances are 2 out of 3 that the mean of the universe 
is within 0.208 cm. of the mean of this sample. Since the mean of 
the sample is 175.335 cm., the chances are 2 out of 3 that the 
mean of the universe is within 0.208 cm. of 175.335 cm.; or, that 
the mean of the universe is between 175.127 and 175.543 cm. 
Likewise we can now deduce the facts that the chances are 95 out 
of 100 (19 to 1) that the mean of the universe is within 0.416 of 
175.335, that is, between 175.751 and 174.919 cm. Likewise 
it is almost certain that the mean of the universe is within 
3(0.208) of the mean of the sample, or, that the mean of the uni¬ 
verse is between 174.711 and 175.959 cm. 1 

Note, then, that we can make definite statements about the 
universe if we accept the assumptions we have been forced to 
make, that the standard deviation^of the universe is equal to the 
standard deviation of the sample. We have studied 100TTheights 
and found an average of 175.335 cm. True, it may be that the 
average of the universe differs somewhat from this figure. But 
we are practically certain that the mean of the universe lies 
between 174.711 and 175.959 cm.; and we can compute the 

1 The instructor may wish to have the student read Sec. 9.6 at this point. 
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chances that the mean lies at any particular distance from that 
found in the sample. 

Let us now recall that our present line of reasoning started with 
the assumption that we knew the facts concerning the universe, 
and were trying to predict the variations which might occur in 
samples drawn from this known universe. Actually, the statis¬ 
tician works the other way around. He has a known Sample, 
and wishes to infer from the characteristics of this sample the 
facts about the universe. Since he has not investigated every 
case In th e " Universe be cannot describe the universe with com¬ 
plete precision. He can, however, make some estimates on the 
basis of his sample if he assumes that the standard deviation 
which he found in his sample is the same as the standard devia¬ 
tion in the universe. It is at this point that he introduces an 
element of error. He has already used up one of the samples 
which could be drawn from his universe, and to this extent has 
influenced his conclusions. We shall say at a later point in our" 
studies that he has lost one """degree of freedom” in his work. 
Consequently now that we are to work backward, from sample to 
universe instead of from universe to sample, we must divide by 
the square root of (N — 1) instead of by the square root of N. The 
formulas which we have given up to this point in this chapter are 
formulas for estimating the characteristics of samples drawn from, ,, 
known universes. If we are to turn now to the much more impor-. | 
tant problem of drawing inferences about unknown universes 
from known samples we should give our formula 1 as follows: 


_ C g 

aj 1 ' 

This we call the formula for the standard, error of the arithmetic 
mean. A standard error is an estimated standard deviation. 
The estimate is based on the characteristics of a single known 
sample, and from this sample we estimate a standard deviation 
t of statistics in many samples drawn from an unknown universe, 
r! Thus the standard error of an arithmetic mean is an estimate of 
f^the standard deviation of the arithmetic means of all possible 
^lamples of given size drawn from a given universe. It is, of 

1 Some statisticians use the symbol for the standard error of the arith¬ 
metic mean where we have used the symbol o-j. 
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course, the best estimate which we can make; but it is neverthe¬ 
less only an estimate. 

To be sure, our new formula for the standard error of the 
arithmetic mean is very nearly the same as our original provi¬ 
sional one. We have merely subtracted unity under the radical 
in the denominator, and if N is large it will make little difference 
whether we.use N or (N — 1). In our“problem ^ 
heights, for example, if "we'Use the correct formula we get 


cr x __ 6.58 

a * ~ VN - 1 ~ V999 


0.208 cm. 


> 


This is exactly the same answer that we got before when we used 
N rather than (N — 1). But when N is small the differenca 
becomes important, and the student will be wise to fall into the 
habit of using the formula in the precise form. 

The student who considers the formula for the standard error 
bf the arithmetic mean will note immediately that this standard 

/ f error varies directly with the standard deviation in the sample 
and inversely with the square root of the number of cases (or 
really this number less one). This is what one might expect on 
purely a priori grounds. We would anticipate that large samples 
would be more reliable and dependable than small ones—that we 
. could draw conclusions concerning the universe from large sam¬ 
ples with more certainty than from small ones. We now note 
that increasing the size of a sample does increase its reliability, 
but not in proportion to the size of the sample. The reliability 
increases with the square root of the number of cases, rather than 
with the number of cases itself. If we want twice the reliability, 
|we must study 4 times as many cases. If we want 6 times the 
feliability, we must study 36 times as many cases. If we want n 
I times the reliability, we must study n 2 times as many cases. Or, 
j to be more specific, we estimated a few paragraphs back that it 
j was practically certain that the mean height of students in the 
j universe from which our sample of 1000 students was drawn lay 
\ between 174.711 and 175.959 cm. This leaves us a range of 
| 1.248 cm. within which we are uncertain. If we want to cut this 
j range of uncertainty in half, we must study not 2000 but 4000 
! cases. If we want to cut the range of uncertainty down to half a 
centimeter, we will be insisting on 1.248/0.5 or 2.50 times as much 
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accuracy. We should, then, study 2.50 2 or 6.25 times as many 
cases, or 6250 students. 

Our formula tells us not only that larger samples will be more 
reliable, and that the reliability will vary with the square root of 
N, but it tells us also that our conclusions with regard to the 
’universe are more reliable if the sample shows a high degree of 
I uniformity—if the variation or dispersion within the sample is 
‘ small. This, too, we would expect. If we study the weights of 
1000 newborn babies and find that they vary tremendously in 
weight, we will hesitate to draw very precise conclusions as to the 
weights which we might draw with the next 1000. If, on the con¬ 
trary, the first 1000 babies chosen at random all have weights 
which are almost identical in size, varying, let us say, over a range 
of but a tenth of an ounce, we would not be likely to suppose that 
the mean of the universe could differ by 20 lb. from the mean we 
had discovered. The two things, then, which determine our 
estimates of the accuracy or reliability of the sample we have 
studied are the size of the sample and the dispersion within it. 

When we talk of “standard error” we are not talking of errors 
in computation, nor of errors which would arise by selecting a 
sample from the wrong universe, nor of errors which might arise 
from inaccuracies in measurement. We are talking merely about 
the errors which we might make in attempting to describe an 
unknown universe on the basis of the characteristics of a sample 
selected at random from that universe. Such samples will differ 
among themselves, and only by chance will they yield statistics 
which are exactly equal to those in the universe. When we com¬ 
pute a number which describes a sample (such a number, for 
example, as an arithmetic mean, a median, a standard deviation, 
or a coefficient of variation) we call the numbe r a sfat^tic. A 
number whit& ds^WTho4m^ 

j the value of 6.58 cm. which we found as the standard deviation 
J of the heights of the Harvard students is a statistic. But if we 
| were to say that the average birth weight of all newborn male 
\ babies is 7.6 pounds, our number would be a parameter. Using 
these terms we can say v that, one -of the basic -purposes of statistics* 
is to see what inferences we can draw about parameters from 
correappudiP&statistics. We find a statistic in a sample ; and we 
try to draw inferences aboutThFc“ori T espbndIng parameter of the 
universe from which the sample came. 
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9.3. The Probable Error.—The standard error of the mean tells 
us a range within which two out of three means of samples will 
lie. Similarly, the standard error of any other measure tells us 
the range within which two out of three similar measures will lie 
m'onier^ampresf' For some unknown reason many people are 
inter esteem the range within which the chances are even that 
the mean (or any other measure) will lie. Within what range 
can we expect the means of half the samples to lie? 

This question is easy to answer if we remember our study of 
the relationship that exists between measures of dispersion (page 
145). We discovered that the semi-interquartile range (which 
includes half the cases) and the standard deviation were related 
in a normal distribution in this way: 

Q = 0.6745<r 

The distance that will include half of the cases is just over two- 
thirds of the standard deviation. 

If we remember this relationship we can compute from any 
standard error the distance within which the chances are even. 
We know that 0.G745 times the standard error wall give us the 
desired result. This value is known as the probable error, and is 
symbolized by the letters PE. Thus the probable error of the 
mean is 0.6745 times the standard error of the mean, and our 
formula would be 

PE, = 0.6745oj = (0.6745) 

If we apply this to the case of students 1 heights, we get 

PE X = 0.6745(7* - (0.6745)(0 208) - 0.14 

We can now say that the mean height of the students in our sam¬ 
ple is 175.335 cm., and that the chances are even that the mean of 
all students’ heights in this unverse is between 175.335 + 0.14 
and 175.335 — 0.14; that is, the chances are even that the true 
mean lies between 175.195 and 175.475 cm. It is important to 
note that the chances are also even that the mean of the universe 
will lie outside this range. Students occasionally acquire the 
idea that the probable error sets the limits of error- -that it is 
the same as “ possible ” error. This is by no means true. Statis¬ 
ticians usually assume as rough limits that chance phenomena 
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will not vary from the mean by more than three standard devia¬ 
tions, which would be 3/0.6745, or almost 4.5 probable errors. 

It is common in scientific work to state the probable error of 
a mean (or of any other measure) immediately after the state¬ 
ment of the mean itself but preceded by a + sign. Take, for 
example, the case just studied. We have said that the mean 
is 175.335 cm. and that the probable error of the mean is 0.14 
cm. Commonly this would be written 

X - 175.34 ±0.14 cm. 

Statisticians reading this would understand it to mean that the 
mean of the sample studied is 175.34 cm., and that the probable 
error of this mean is 0.14 cm. It is becoming increasingly com¬ 
mon for men to use the standard error rather than the probable 
error, and there are decided advantages in so doing. When the 
standard error is given, the fact should always be pointed out, 
however, because it is understood when one sees two figures 
separated by a ± sign that the second figure is a probable error. 
One might, in giving standard errors, make a statement similar 
to this: 

“The average height of students and the standard error of 
the average are 175.34 ± 0.14 cm.” 

One could, of course, invert the ± sign in giving standard 
errors to distinguish them from probable errors, thus: 

X = 175.34 + 0.14 cm. 

If such a convention could be universally adopted, it would save 
much explanation. At present, however, such usage would not 
be understood. The probable error is rapidly passing out of use, 
and the student will be w^ell advised to use the standard error in 
preference. 

9.4. Other Standard Errors. —Just as we hay^ discavered that 
means of various samples differ, so is there variation in the stand¬ 
ard deviations of different samples. The values of, the median 
will not be the same for all samples; there will be variation in 
the values of the quartiles and of a 3 and a 4 . Just as we wish to 
discover the amount of such variation which can be expected in 
the cases of means of samples, so we wish to know the reliability 
of other measures. The concept is the same as that already 
explained in the case of the mean; hence it will not be necessary 
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to go through most of the explanation again. We shall give the 
formulas for the standard errors of several of the measures which 
we have discussed in earlier chapters, and in most cases give an 
example of the computation. 

Standard Error of the Standard Deviation: l 

a. = = 0.707107*, 

•s/2 (N - 1) 


In the case of students’ heights we have found the standard 
deviation of heights to be 6.58 cm. with 1000 cases (page 143). 
Substituting in the formula, we get 


<r* 


6.58 

\/T998 


0.147 


We had already computed the o- x as 0.208 (page 237). Thus we 
can use the second formula above to get 

a = 0.707107(0.208) = 0.147 

This means that the chances are 2 out of 3 that the true standard 
deviation of the universe is within 0.147 cm. of the standard 
deviation of the sample, that is, between 6.58 + 0.147 and 


1 This formula for the standard error of the standard deviation is strictly 
correct only in mesokurtic distributions. When « 4 < 3 the standard error 
is smaller, and when « 4 > 3 it is larger, than that given by the formula. 
Unless the value of a a is close to 3, the error will be large. One can deter¬ 
mine the standard error of the standard deviation of any distribution, 
normal or otherwise, by the formula 


ffe 


V' 


f Va « vi l 
4 MN) 


The v's in this formula are the higher moments about the mean as found in 
the preceding chapter (pages 191if.). If Sheppard's corrections are used, 
the corrected moments should be substituted. If we substitute the values 
of the moments of the heights of Harvard students from page 191, this 
formula becomes 


<Tff 


/5508.567 - 43.353 2 _ 
> (4) (43.3 53) (1000) 


In this case, since the distribution of heights is practically normal, the 
result obtained by this method is almost identical with the result obtained 
by the method more commonly used. 
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6.58 — 0.147, or between 6.433 and 6.727 cm. It is practically 
certain that the true standard deviation lies between 6.58 + 
3(0.147) and 6.58 — 3(0.147), or between 6.139 cm. and 7.021 
cm. We have seen that the probable error of any measures is 
0.6745 times the standard error. Thus we can say that 

PE. - 0.6745 ( '■ ) - . 0.099 

Vv'2(.V - 1)/ V1998 

Hence we can say the chances are even that the true stand¬ 
ard deviation of the universe lies between 6.58 + 0.099 and 
6.58 — 0.099, or that it lies between 6.679 and 6.481 cm. The 
standard deviation would usually be written in conjunction with 
its probable error, thus: 

<r = 6.58 ± 0.099 cm. 


Standard Error of the Median: 


CTMecl 



J.25331o* 


We have found that the median height was 175.28 cm. (page 
95. We have also found that aj is 0.208 cm. (page 240). Sub¬ 
stituting in the formula above, we have 

<w - 1.25331(0.208) = 0.261 


The chances are 2 out of 3 that the true median lies between 
175.28 + 0.261 and 175.28 - 0.261, or between 175.019 and 
175.541 cm. It is almost certain that the median of the universe 
lies between 175.28 + 3(0.261) and 175.28 - 3(0.261); that is, 
between 174.497 and 176.063. 

Standard Errors of the Alphas: 



We discovered the following values: 

as = —0.034 (page 192) 
a 4 — 2.926 (page 192) 
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The standard errors of these values are 

- -s/w - 00775 

<r a4 - 2<r a , = 2(0.0775) = 0.1550 

It will now be recalled that we used these values to determine 
whether or not the distribution of students’ heights was normal. 
In a normal distribution these values would have been = 0 
and a 4 = 3. The value of our sample is —0.034, which differs 
from normal by 0.034, or by (0.034/0.0775)cr. We have earlier 
discovered how to compute the chances of such an occurrence 
(page 169). We have here a deviation of (0.034/0.0775V or of 
0.44<r. In 50 per cent of the cases our deviations would be in the 
other direction, and the tables show us (see page 514, Appendix I) 
that between the mean and 0.44cr from the mean lie 17 per cent 
more of the cases. Altogether, then, there are 67 per cent of 
the cases with less deviation than this. Hence 33 per cent of 
the cases would deviate more, or in 33 per cent of the cases the 
value of a 3 would have been either 0 or positive. In other words, 
it is quite possible that the distribution from which this sample 
was drawn was not really skewed. If the value of a 3 differs from 
0 by more than three times its own standard error, we should say 
that such a skew could not be expected to arise by chance in a 
sample drawn at random from a symmetrical universe. In other 
words, when the value of differs from 0 by more than three 
times its standard error, we conclude that there is good evidence 
that the universe is itself skewed. 1 When, as in the present case, 
the value of a 3 differs from 0 by less than three times its standard 
error, we are not certain that the universe was itself skewed. It 
is quite possible that one would by chance draw a sample with 
as much skewness as the present one from an unskewed universe. 
In fact we should get as much skewness as that of our present 
sample in about 33 per cent of all chance samples from unskewed 

1 In Fig. 8.2, p. 201, are two skewed distributions. In the upper distribu¬ 
tion the value of « 8 is 4*0.307 and its standard error is 0.0142. The fact that 
a % is 21 times its standard error leads us to believe that the wage distribution 
from which this sample was drawn was almost certainly skewed positively. 
In the lower distribution the value of is —0.518 and its standard error is 
0.0438. Since the value of a* is over 11 times the value of its standard 
error, we conclude that the egg-production data from which the sample 
was drawn were almost certainly negatively skewed. 
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universes. 1 We conclude, then, that there is no certain indica¬ 
tion of skewness in students’ heights in the data of our sample. 

Similarly we find that the value of a 4 is 2.926, although the 
value for a normal distribution is 3.0. The value in our sample 
differs from the normal by 2.926 — 3 = —0.074. The standard 
error itself has a value of 0.1550. Thus the difference is 


0.074 

0.1550 


-0.48a 


By pure chance we should get an absolute difference less than this 
in 50 + 18.4 per cent = 68.4 per cent of the cases. This means 
that we should get samples with values of a 4 as small as this, or 
smaller from distributions in which there was actually no kurtosis, 
in 31.6 per cent of the cases merely through the operation of 
chance. Since this is true, we see that the value of may well 
be as high as 3 in the universe even though it is but 2.926 in the 
sample; that is, there is no evidence that kurtosis exists in the 
universe unless the value of a.\ differs from 3 by an amount which 
is more than three times the standard error of a 4 . Here the 
value of a 4 differs from 3 by an amount equal to but 0.48 times 
the standard error of 

Standard Error of a Relative Frequency ( Percentage): 


<7 % — 


Ivk = 

Vat viV 


Dr. Charles V. Chapin of Providence, Rhode Island, states 2 
that 25.45 per cent of 53,280 people who were exposed to diph¬ 
theria between 1889 and 1915 caught the disease. What is the 
standard error of this figure, 25.45 per cent? We find it from 
the formula. We know' that p is the probability that the event 
will happen and q is the probability that it will fail to happen 
(see page 155). Here our sample shows that p = 0.2545 and 
q must equal 0.7455. (Since 25 per cent of the exposed persons 
were afflicted, we know that 2 5foo of them were afflicted, or 
0.25 of the total number. Thus p = 0.25—or, to be exact, 
0.2545.) The number of cases studied is given as 53,280. Sub- 

1 See Table 9.1, p. 258. 

2 Quoted in Whipple, “Vital Statistics,” p. 376, John Wiley & Sons, Inc., 
New York, 1933. 
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stituting in the formula, we have 

/(02545) (0.7455) 

V 53,280 
= 0.0019 

Thus 0.2545 did fall ill, and the standard error is 0.0019. Putting 
both figures back in percentage terms by multiplying by 100, we 
say that the attack rate was 25.45 per cent, with a standard 
error of 0.19 per cent. Practically never should we expect to 
get an attack rate higher than 25.45 per cent + 3(0.19 per cent), 
or 26.02 per cent. Practically never should we expect to get an 
attack rate lower than 24.88; that is, if we continued to take 
samples from the same universe (samples of people of the same 
age getting the same kind of medical care, leading the same 
kinds of lives, etc.) we should expect always to find that between 
24.88 per cent and 26.02 per cent of the people exposed would 
come down with diphtheria. 

Standard Error of the Semi-interquartile Range: 

ctq = 0.7867<rj 

We have found that the standard error of the mean is 0.208 cm. 
in the case of heights of students. The semi-interquartile range 
is 4.44 cm. (page 129). We now see that the standard error of 
this semi-interquartile range is 

<r y = 0.7867(0.208) = 0.164 

Standard Error of the Average Deviation: 

cad = 0.605<rX 

We discovered in our illustrative example that the average 
deviation of the students’ heights in our sample is 5.28 cm. (page 
134). We have already seen that the standard error of the mean 
is equal to 0.208 cm. (page 237). Substituting in the formula, 
we find 

cr ad = 0.605(0.208) = 0.126 cm. 

Standard Error of Either Quartile: 

o-q, = <7q, = 1.36263ory 
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This means for our problem: 

<r Ql = <TQ t = (1,36263) (0.208) = 0.284 

Standard Error of 0 2 .—We have seen on page 212 that /3 2 is 
identical with a 4 . Hence the formula given on page 245 for the 
standard error of a 4 will apply also to /? 2 . 

Standard Error of Measures of Skewness. —On page 205 we used 
as a measure of skewness the value 


VfiiQS2 + 3) 
2(5/3s - 6/Si - 9) 


The standard error of this value is 

_ 1.225 
<78k " VN 


This formula does not, of course, hold for other measures of 
skewness. If we apply this formula to our illustrative case, the 
value of skewness is found to be 0.0179 (page 205). It is based 
on 1000 cases. Hence the standard error is 


0Bk. 


1.225 

VTOOO 


0.0380 


Unless this measure of skewness differs from 0 by an amount 
greater than three times its standard error, we must say that we 
are not justified in assuming that skewness existed in the uni¬ 
verse from which our sample was drawn. 

This same formula for the standard error of the measure of 
skewness can be applied when skewness is measured on the basis 
of the difference between the mean and the mode, according to 
the formula on page 201. 

Standard Error of the Coefficient of Variation: 



We discovered that the coefficient of variation of the students' 
heights in the sample is 3.75 per cent (page 152). The standard 
error of this figure would, then, be 



260 


ELEMENTS OF STATISTICAL METHOD 


0v 


... 3.75 I , 2 A75V 

vsoooV^^oo-; 

= 0.084(\/i -0028) = 0.084 


If the coefficient of variation is less than 10 per cent, we can 
approximate its standard error closely enough by the formula 


C| 


F_ 

V2N 


Standard Error of the Difference between Two Measures .—In 
statistical work we are very often interested in differences and 
in their significance. Suppose, for example, that we wish to dis¬ 
cover whether or not there is a significant difference between the 
numbe,r„of ^pe^rs borne on male and on female asparagus plants, 
ftaber tells us 1 that the average number of spears on the male 
plants which he studied was 15.37 and the average number of 
spears per female plant was 9.39. We know that if he had taken 
another sample of male plants the average number of spears 
might have differed somewhat from 15.37, and in another sample 
of female plants the average might have differed from 9.39. We 
are interested in the difference between the two means, 15.37 and 
9.39. The difference is 15.37 — 9.39 — 5.98 spears. With other 
samples yielding other means, could it be expected that there 
would continue to be a difference of this kind? Could we expect 
that the means would still show the male plants bearing more 
spears on the average than the females? We discover the answer 
to this question by computing the standard error of the differ¬ 
ence by means of the formula 

_ \ 

l'l CDiff. = V<OV 2 + <T2 2 

U 


11 

f‘ 

»! 

if 

I 


in which <r Dlff . represents the standard error of the difference 
between two measures, a represents the standard error of the 
first measure, and <x 2 the standard error of the second measure. 2 

1 E. S. Haber, Journal of Agricultural Research , Vol. 45, July, 1932, p. 103. 

2 This formula is accurate in the form given here only if the two measures 
whose difference is being studied are uncorrelated. If correlation exists 
between them the formula becomes 


<TDiff. 


VVi 2 + cr^ — 2r\<ip \cr i 


When there is no correlation, this formula reduces to the one above. The 
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In our case this means that we must have the standard errors 
of the two measures whose difference we are studying. These 
are given by Haber as 0.88 for males and 0.05 for females. (Note 
that these are not the standard deviations of the numbers of 
spears, but the standard errors of the two means found by the 
methods described on page 236.) Now that we know the two 
means and their standard errors, we can proceed to discover the 
standard error of the difference between the means. 



Average 
Number of 
Spears 

Standard 
Error of Jt 

Male plants. 

15.37 

0.88 

Female j)lants. 

9.39 

0.05 



The difference itself is 15.37 — 9.39 ~ 5.98. Its standard 
error is 

CT !>,„ = \/0.88 2 + 0.05 2 = V67744~+ 0.0025 
= V0.77(i<J =0.881 

The difference is 5.98, and its standard error is 0.881. 

It will be noted that the difference is equal to 0.8 times its 
standard error. We should almost nevgr get, by pure chance, a 
difference equal to more than three times its standard error (only 
about three times in a thousand). 1 A difference 6.8 times its 
standard error would not arise once in a billion times. In other 
words, it is inconceivable that this difference between the num¬ 
bers of spears on male and on female plants should arise by chance 
from data whose averages are really the same It must, be true 
that in the entire universe of male asparagus plants the average 
number of spears per plant is greater than is the average in the 
entire universe of female asparagus plants. 

If, then, we find a difference between two measures which is 
greater than three times the standard error of the difference, we 
say that such a difference would not be expected to arise from 

nature of correlation is discussed in Chap. XV; the formula given in this 
footnote can be understood after that chapter has been mastered. In 
the meantime, this is one of the cases in which the simpler formula may 
correctly be used; we shall later see why. 

1 See pp. 253Jf. 
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pure chance. It must have arisen because the samples were 
drawn from two universes whose means were different. 

Just as we have found the standard error of the difference 
between two means, we can find the standard error of the differ¬ 
ence between any other two measures whose standard errors 
are themselves known. If we have the two measures and their 
standard errors, we can compute the standard error of their 
difference by the formula given on page 250. We shall show one 
more example. A group of 1150 Wellesley freshmen had an 
average height of 64.13 in., with a standard deviation in heights 
of 2.24 in. A group of 1017 Hollins College freshmen had an 
average height of 63.86 in. with a standard deviation in heights 
of 2.09 in. 1 From these data, by the formula given on page 244, 
we can compute the standard errors of the two standard devia¬ 
tions. They are 




&<T1 


2.24 

V2298 

2.09 

v/2032 


= 0.0466 
= 0.0464 


The difference between the two standard deviations is 
2.24 - 2.09 = 0.15 in. 

The standard error of the difference is 

(r D ,«. = V043466 2 + 0.0464* 

- V0J10218 + 0.00215 

- y/Offim « 0.0658 


We now compare the difference with its standard error. The 
difference is 0.15 in. Its standard error is 0.0658 in. The differ¬ 
ence between these two standard deviations is 2.28 times its 
standard error. If the universe from which the Wellesley girls 
and the universe from which the Hollins girls were drawn really 
had the same variability (that is, if the standard deviations of 
the two universes were the same), then half the time when we 
drew samples we should find the standard deviation of the Hollins 
sample larger than that of the Wellesley sample. Moreover, 

1 G. L. Palmer, Journal of the American Statistical Association , Vol. 24, 
March, 1929, p. 42. 
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reference to Appendix I, page 514, reveals that another 48.9 
per cent of the samples would have Wellesley standard devia¬ 
tions above but within 2.28a of the Hollins standard deviation. 
Hence in 98.9 per cent of the cases differences would be found 
smaller than this. By pure chance the Wellesley standard 
deviation would exceed the Hollins standard deviation by this 
amount or more in 1.1 per cent of the cases, or 11 cases out of 
1000. The statistician would say that 11 cases in 1000 is a 
significant proportion, and that it is not certain that the Wellesley 
girls were really more variable in height than the Hollins girls. 
It is quite possible that this case is one of the 11 cases in which 
such a difference would arise by chance. We could tell only by 
taking more cases and seeing whether the difference persisted. 
Had the difference between the standard deviations exceeded 
three times its standard error, we should have said it was evident 
that the Wellesley girls came from a different universe than did 
the Hollins girls—from a universe in which the standard devia¬ 
tion in heights was certainly larger than in the Hollins universe. 

Standard Error of the Sum of Two Measures .—If two measures 
(such as two averages or two standard deviations) of uncorrelated 
data are added, the standard error of their sum is given by the 
formula 

ffflum = V^l 2 + <T2 2 


This is seen to be the same as the formula for the standard error 
of the difference when the measures are for uncorrelated data. 

If the data are correlated, the formula for the standard error is 
the same as that given in the footnote on page 250, except that 
the minus sign under the radical is changed to a plus sign. 

9.5. The Significance of Differences. —There is no difference \ 
so large that it could not occur by chance in two samples drawn j 
from the same universe. Conceivably two such samples might j 
differ by any amount in their means, in their standard deviations, j 
or in any other way. Yet some happenings are so unlikely that j 
their occurrence can hardly be looked on as a chance phenome¬ 
non. If someone throws two dice 15 times and gets a total of 
7 spots on each throw, one wonders if chance is the only force 
that is operating. It is possible that an honest man should 
throw one hundred 7*s in succession with honest dice, but it is so 
unlikely that most opponents would decide long before the 100th 
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throw that either the thrower or the dice were dishonest. The 
fact that an event can happen by chance does not mean that we 
are willing to ascribe such a happening to chance when it occurs: 
If its happening as a result of chance is extremely unlikely, we 
usually decide that some factor other than chance has played a 
part. 

This is true not only with dice but with all events in which 
chance operates. We realize, for example, that even if there 
were no difference between the body weight of male and female 
rats, were we to take a sample of males and a sample of females 
and compute the mean weight for each sample the two means 
would probably differ somewhat. And actually there is no 
limit to the amount of the difference that might arise from chance. 
Rut some differences would arise so seldom by chance that, if 
they arose in our samples, we should be led to believe that some 
factor other than chance was responsible. We should decide 
that the rats had been drawn from different universes—that the 
universe of male rats differed significantly in this respect from 
the universe of female rats. Hatai weighed 45 male rats and 
37 female rats and found the following: 1 



Male 

Rats 

Female 

Rats 

Number of cases. 

45 

37 

Mean weight (grams). 

214.9 

167.3 

it of weights (grams). 

52.89 

20.47 


From these data we compute the following standard errors 
of the means by the formula given on page 237: 

y « 

, Males . 7 98 

I, *% , Females . . 3 4J J 

/ ** * * 

The difference between the two means is 214.9 — 167.3 = 47.6 
grams. The standard error of the difference, computed according 
to the formula on page 250, is 8.68. Since the difference is 47.6 
and its standard error is 8.68, the difference is 5.49 times its 
standard error. Could such a difference arise by chance? Yes, 

1 Quoted in H. H. Donaldson, The Rat, Wistar Institute of Anatomy arid 
Biology Memoir No. 6, Philadelphia, 1924, p. 50. 
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any difference could arise by chance. But if th e malejn^icmale 
rat universes were the same, in half the cases the mean of the 
male sample would be less than the mean of the female sample. 
And the table in Appendix I, page 514, tells us that between 
the mean and 5<r from the mean there are included another 
49.99997133 per cent of the cases. Thus even if we go up only 
to 5(7 (and this problem goes beyond this to 5.49<r), we have 
included 99.99997133 per cent of the cases. We should get a 
difference over five times its standard error from pure chance 
but once in over 1,700,000 times. To believe that this difference 
between mean weights of male and female rats arose from chance 
is somewhat like believing that an honest man can throw honest 
dice and get 8 successive 7’s. It could happen, but would 
happen so seldom that we are prone to ascribe its occurrence to 
factors other than chance. Thus here we should say that, 
although male and female rats could yield such different means 
by chance, we are forced to decide that such a difference did 
arise from some other source-* namely, from the fact that the 
universes were different. 

How much of a difference shall we allow to exist before we 
say that chance did not account for it? This is like asking 
how many times you will allow a man to throw 7’s with dice 
before you will look for non-chance explanations. Men differ 
in their credulity. In a gambling game their credulity depends 
somewhat, perhaps, on whether they have something at stake. 
Some men would be suspicious of dishonesty in the throwing of 
5 successive 7’s; others would look upon the affair as unusual 
but still due to chance. The same is true in statistical work. 
People differ in their credulity. It is possible even that some 
people might believe that a difference such as the one we have 
just discovered between rats’ weights arose by chance. If some¬ 
one tosses a penny and it falls with “heads” uppermost, no one 
would rule out the possibility that it happened by chance, even 
though there was an even chance that it would come up “tails.” 
Most people would not rule out chance if something happened 
against which the odds were 2 to 1. Statisticians, like other 
people, differ in credulity, but they have adopted certain com¬ 
pletely arbitr ary li mits beyond which they assume that chance 
does not operate, It has long been customary in the United 
States to set these limits at three standard deviations or three 
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standard errgrs, and to say that any difference is “ sigi^fic&nt ” 
if it exceeds three times its standard error. If chance alone were 
operating, we should get differences smaller than this 99.74 per 
cent of the time and larger differences only 26 times out of 10,000. 
The chances against such an occurrence on the basis of chance 

( alone are almost 400 to 1. , 

In the past decade or two many statisticians, especially in\ j 
England, have adopted as limits of significance what they call j j 
the “5 per cent point ” and the “1 per cent point.” These arej j 
• the limits which would be exceeded by chance in but 5 per cent or j j 
1 per cent of the cases. They correspond to values which differ 
from the mean by 1.96 standard deviations and 2.58 stancj^xd J 
deviations,, xeapectively. When the ratio of the difference to its | 
standard error lies below 1.96 they would call it non-siqj i ifi^^ r j 
when it lies between 1.96 and 2,58 they would call it sigyjfimti, I 
and when it lies above 2.58 they would call it highly significant. 1 f 
The older custom in the United States has been to be more 
conservative, and to call a difference significant only when it is 
at least three times as large as its standard error. The student 
should understand that any such limits are arbitrary—that no 
one has the right to insist that one of them is “right ” and another 
is “wrong.” The statistician who relies on the “5 per cent point” 
and says that a given difference did not arise by chance will be 
wrong 1 time in 20; the statistician who makes the same state¬ 
ment based on the “ 1 per cent point ” will be wrong 1 time in 100; 
j&nd the statistician who uses the criterion of three standard 
^errors will be wrong only three times in a thousand. Even large 
differences can arise by chance, but when they get large enough 
the likelihood that they should arise by chance alone becomes so 
small that the statistician feels justified in neglecting it and in 
assuming that the difference is significant of the operation of 
forces other than chance. Likewise if we say that the value of 
az is “significantly ”’ different from zero we mean that, although 
such a value could have arisen in a sample drawn from a universe 
in which the value of a 8 was zero, nevertheless the likelihood of 

1 Actually, the values 1.96 and 2.58 are correct only when the samples 
include large numbers of cases. With small samples it is necessary to 
increase these limits somewhat to include 95 per cent and 99 per cent of the 
cases. In such cases the statistician uses tables of t rather than tables of 
standard errors. 
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such an occurrence is so small that we neglect it and assum e that 
some forces other than chance operated. Whenever the differ¬ 
ence between the value of a 3 and zero is greater than three times 
the standard error of a 3 , we come to this conclusion that the 
difference is significant. One might have chosen the arbitrary 
point of 2.98 times the standard error or of 3.7 times the standard 
error. Actually it is more convenient to take a whole number of 
standard errors and base our test of significance on it. 

/ The reader should remember that, when a difference is less 
ithan three times its standard error, there is no guarantee that 
j it did arise from chance. On page 252 we found that the differ¬ 
ence between two standard deviations was equal to but 2.28 
times its standard error. The odds are better than 40 to 1 
against the occurrence of a difference as great as this. Yet in 
spite of the fact that there may well have been a difference 
between the standard deviations of the two universes from which 
these samples were drawn, some statisticians do not feel sure 
of it when the odds against it are but 40 to 1. Odds of 40 to 1 
are not certainty to every statistician. He might say in this 
case that the proper thing to do would be to measure more 
Wellesley and Hollins freshmen to see whether the difference in 
standard deviations continued to exist. If it did not, then it 
would signify that the difference had arisen from chance; if it 
did continue to persist as more and more cases were added (that 
is, as N increased in each sample), the standard errors of the 
standard deviations would fall, the standard error of the differ¬ 
ence would fall, and ultimately the difference would be three 
times its standard error. Then the statistician would decide 
that he had gone far enough, and that he would be safe in ascrib¬ 
ing the difference to forces other than chance. 

In this connection it may be helpful for the student to see 
just what the chances are of getting a difference by chance which 
is various numbers of times its standard error. The chances 
can be easily computed from tables such as that on page 514 
of Appendix 1. In Table 9.1 the first column is the difference 
divided by its standard error, and In the other column are the 
approximate chances against the occurrence of such a difference 
"by pure chailce. -- 1 ' 

*nP^apslFis as well to point out here that a large difference, 
an important difference, and a significant difference are not at all 
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the same concept. When we say that a difference is large, we 
are referring to the actual size of the remainder left after sub¬ 
traction. Thus a difference of 5 lb. is 16 times as large as a 
difference of 5 oz. Yet the 5-lb. difference may not be significant 
because its standard error is large, while the 5-oz. difference may 
be significant because its standard error is small. When we 
say that a difference is significant, we mean that we are con¬ 
vinced it did not arise by chance, but reflects a difference which 

Table 9.1.—Chances against the Occurrence of a Deviation Farther 
from the Mean than the Distance Stated 


Difference 

(tfDUf.) 

Oluinces 

0.6745 

1 

to 1 

1.0 

2.15 

to 1 

1 5 

! 6.48 to 1 

2 0 

21.0 

to 1 

2.5 

79.5 

to 1 

3.0 

369 

to 1 

3.5 

2,150 

to 1 

4.0 

15,800 

to 1 

4 5 

147,000 

to 1 

5 0 

1,740,000 

to 1 


j actually exists in the universes from which the samples were 
| drawn. And finally, a difference might be both large and sig- 
j nificant, yet unimportant. Whether a difference is important 
or not depends on what it can contribute toward an explanation 
of the problem being studied. A difference which does not con¬ 
cern the statistician in any way—which raises no problems for 
him and does not help him to explain the phenomena that he is 
considering- is unimportant. The student must learn that 
statistical manipulations never make anything important, that 
they are a means and not an end. To be sure, if a difference is 
important to us we are likely to wish to test its significance, and 
if it is large it is more likely to be significant than if it is small. 
The three ideas are related—but they are by no means the same. 

9.6. Fiducial Probability and the Confidence Interval.—The 
use of terms in the field of probability is by no means standard¬ 
ized. Authorities disagree even as to the definition of the term 
“ probability ” itself, finding it one of the hardest of concepts 



MEASURES OF RELIABILITY 


259 


to define without circularity. The student should be warned, 
therefore, that the followers of some schools of thought would 
object to certain of the expressions used in this chapter, prefer¬ 
ring to state the conclusions of reliability analysis in other terms. 

For example, in Sec. 9.2, page 240, we found that the arithmetic 
mean height of a group of students was 175.335 cm. and that the 
standard error of this mean was 0.208 cm. We interpreted this 
by saying that, although we could not know for certain the exact 
size of the average height in the universe, nevertheless the chances ‘ 
were 2 out of 3 that this mean height lay in the interval from ? 
175.127 to 175.543 cm.; the chances were 95.5 out of 100 that ; 
it lay in the interval from 174.919 to 175.751 cm.; and the chances * 
were 99.7 out of 100 that it lay in the interval from 174.711 to j 
175.959 cm. We could summarize those conclusions as follows: i 

Probability That Mean j 

of Universe Lies m f 

This Interval 
0 0827 
0 1)545 
0 9973 


Interval of 
Height (cm.) 
175 127-175 543 
174 919-175 751 
174 711-175 959 


Now as we have just said, some authorities would say that 
these statements are incorrect. If you were to ask them, “ What 
is the probability that the arithmetic mean in the universe lies 
between 175.127 and 175.543 cm.?” they would answer that it 
is not a question of probability at all, but a matter of fact. 
Either the mean of the universe does lie in this range, in which 
case the probability is 1, or it does not he in this range, in which 
case the probability is 0. They would argue that there is no 
probability of 0.6827, but a probability which is either zero or 
unity—or that it is not really a case of probability in the strict 
sense at all. 

We could argue, of course, that the same stand dan be taken 
on any probability problem. What is the probability that if I 
toss a coin it will come up “heads”? I can argue that if I do 
toss a coin it will either come up heads (in which case the prob¬ 
ability is 1), or it will come up tails (in which case the probability 
is 0). So I could maintain that there is no probability of ^ 
in this case, but a probability of either 0 or 1. Yet we do know 
that the idea of a probability of 3^ is useful in the case of the 
coin, and that it describes something which is relevant to the 
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problem. While it is true that on any particular toss the coin 
will fall either heads or tails, it is also true that if I toss it over 
and over again, many, many times, I will be right approximately 
half the time if I predict each time that it will come up heads. 

In the case of the students’ height, if I say that the mean of the 
universe lies in the interval from 175.127 to 175.543 cm., I shall 
be either right or wrong. But if, on many statistical problems, 
I draw similar inferences, all based on this same sort of reasoning, 
I shall be right approximately C8.27 per cent of the time. We 
notice, then, that the probability about which we are talking 
is not the probability that the arithmetic mean of the universe 
has some given size in a particular problem, but the probability 
that our statements about statistical results are correct. Actu¬ 
ally, this is the same thing that we are doing if we say that the 
chances are 0.5 that I will get a head on the toss of a coin. If 
I make many such statements I shall be right about half the time. 

This sort of probability, which refers really to the likelihood 
that statements about statistical results are correct, is called 
fiducial probability , and where we have said throughout this 
chapter that the chances are 0.6827 that the mean of the universe 
lies within the interval from 175.127 to 175.543 cm., some statis¬ 
ticians would prefer to say that the fiducial probability of the 
confiden ce interval 175.127 to 175.543 cm. is 0.6827. 

7 We have use'd three different confidence intervals, one cover- 
ling two standard errors, one covering four standard errors, and 
•the other covering six standard errors. The wider the confi¬ 
dence interval, the larger the proportion of our statements which 
twill be correct. If we take a confidence interval two standard 
errors wide, and say that the arithmetic mean height in the uni¬ 
verse lies between 175.127 cm. and 175.543 cm., we will be either 
right or wrong, but in many such cases we shall be right 68.27 
per cent of the time. If we make many such statements as that 
the arithmetic mean height in the universe lies between 174.919 
cm. and 175.751 cm., two standard errors from the mean, we shall 
be right 95.45 per cent of the time. If we set limits three stand¬ 
ard errors from the mean, and make many such statements as 
the statement that the arithmetic mean of the universe lies 
between 174.711 cm. and 175.959 cm., we shall be right 99.73 
per cent of the time. Or if we adopt the 5 per cent point, taking 
points 1.96 standard errors from the mean, and make many state- 
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ments such as,“the arithmetic mean height in the universe lies 
between 174.927 cm. and 175.743 cm.,” we shall be right 95 per 
cent of the time. And if we adopt the 1 per cent point, taking 
points 2.58 standard errors from the mean, and make many state¬ 
ments such as, “the arithmetic mean height in the universe lies 
between 174.798 cm. and 175.872 cm.,” we shall be right 99 per 
cent of the time and wrong but 1 per cent of the time. These last 
two intervals are often called the 0.95 fiducial interval and the 0.99 
fiducial interval , or alternatively the 0.95 and the 0.99 confidence 
intervals. The wider our confidence interval, the-moreconfidence 
we have in our results. On any particular single problem our 
statement that the “mean lies within a given interval is either cor¬ 
rect or incorrect; but if we draw many, many statistical infer¬ 
ences by the methods here described, we can know in advance 
about how often we shall be right and how often wrong. 

9.7. Suggestions for Further Reading.—A number of interesting examples 
of the application of measures of reliability are found in William V. Lovitt 
and Henry F. Hotzclaw, “Statistics,” Chap. XV, Prentice-Hall, Inc., New 
York, 1929. A good elementary discussion of the concepts involved in 
measuring reliability is found in Frederick C. Mills, “Statistical Methods 
Applied to Economics and Business,” Chaps. XIV and XVIII, Henry Holt 
and Company, Inc., New York, 1938. A very able but more advanced 
discussion of this problem appears in Burton H. Camp, “The Mathematical 
Part of Elementary Statistics,” D. C. Heath and Company, Boston, 1931. 
For the best description of the new methods of treating these problems, the 
student is referred to R. A. Fisher, “Statistical Methods for Research 
Workers,” 3d ed., Oliver & Boyd, Edinburgh and London, 1930, especially 
ChapB. IV and V. One of the simplest, most lucid discussions of the general 
problem of reliability will be found in C. H. Richardson, “An Introduction 
to Statistical Analysis,” Chap. 11, Harcourt, Brace and Company, Inc., 
New York, 1935. A very fine advanced treatment is in John H. Smith, 
“Tests of Significance: What They Mean and How to Use Them,” Uni¬ 
versity of Chicago Press, Chicago, 1939. 

EXERCISES 

1. List a few cases in which it would be possible for the statistician to 
study all the cases in the universe, so that he would not have to estimate the 
characteristics of the universe from a sample. 

2. From Exercise 2, page 153, compute the standard errors of the means 
and of the standard deviations, and likewise of the two coefficients of 
variation. 

3 . Compute the standard error of the coefficient of variation in the heights 
of Smith College girls from Exercise 3, page 153. 
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4 . Compute the standard error of the mean, of the standard deviation, 
and of the coefficient of variation of the mothers' ages given in Exercise 4, 
page 154. 

6 . The average number of offspring in 55 completed families is given in 
Exercise 5, page 154, as 3.55. The standard deviation is 1.79. Hence the 
standard error of the mean is 0.244. (Check this computation.) IIow 
many cases will it be necessary to take if we are to reduce the standard error 
of the mean to 0.1? Assume that the standard deviation in the number of 
offspring remains the same as we increase the number of families Btudied. 

6 . Exercise 6, page 154, gives figures to show that the 22,498 divorces in 
Wisconsin from 1887 to 1906 were preceded by an average married period 
of 10.37 years, with a standard deviation of 8 39 years The average 
and standard deviation of 2651 cases 1929 were 9 83 and 8.26 years, 
respectively. Had there been a significant decrease in the length of mar¬ 
riages preceding divorces? Was the decrease in variability as shown by the 
smaller standard deviation significant, or might it arise from chance? If 
the latter, how likely is it that such a difference would arise by chance alone? 

7. Suppose that a group of people are tested with respect to strength of 
grip in their right hands. The average turns out to be 40 kg , with a stand¬ 
ard deviation of 1 kg. There are 40 people in the group. Is it reasonable 
to assume that this group of 40 people is drawn from the same universe as 
the people for whom figures are given in Exercise 7, page 154? (There were 
12 men in this latter group.) 

8 . In Exercise 3, page 124, we computed the moan and median hourly 
wage and the quartiles. In Exercise 1, page 153, we computed the standard 
deviation of these figures. From these figures already computed find the 
standard error of the mean, of the standard deviation, of the median, of the 
quartiles, and of the semi-interquartile range. 

9. In Exercise 2, page 230, we computed the value of «s for a given dis¬ 
tribution. Compute the standard error of as. Was there significant skew¬ 
ness in the distribution? (Did a 8 differ from zero by an amount equal to 
over three times the standard error of a 3 ?) 

10. In Exercise 13, page 18(5, we found that 53 of the 625 diphtheria cases 
studied turned out fatally. The fatality rate is thus 8.5 per cent. What 
is the standard error of this percentage? If in a new epidemic there were 
200 cases and 8 of them resulted in fatalities, would the difference in fatality 
rates be significant.? If so, what would you conclude? Suppose there 
were 25 fatalities in this new epidemic. What would you then conclude? 
Give your reasoning. 

11 . Is there a significant difference between the heights of the Smith 
College students mentioned in Exercise 3, page 153, and the heights of the 
Harvard students mentioned in the illustrative example in the text (on 
pages 141Jf., for example)? Is either group significantly more variable in 
height than the other group? 

12. In Exercise 8 above you have found several standard errors. Com¬ 
pute the probable errors of the same values. Write the values followed by 
their probable errors as they would commonly appear. 
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13 . Hatai measured the lengths of the craniums of 53 male rats and says 
that the average length was 43.3 ± 0.17 mm. 1 Explain thiB combination 
of figures. 

14 . In Providence, Rhode Island, in 1915 there were 43 cases of diphtheria 
in children between the ages of one and two years. Of these, 13.95 per cent 
resulted in deaths (6 deaths out of 43 cases). In the same city in the same 
year there were 62 cases of diphtheria in people who were 20 years old or 
over. Of these, 3.23 per cent resulted in deaths (2 deaths out of 62 cases). 
Was there a significant difference between these percentages? 2 

15. In the “ World Almanac ” can be found the monthly mean tempera¬ 
tures in New York City over a considerable period of years. Compute 
for January and for July the average and the standard deviation of tem¬ 
peratures. Is there a significant difference between January and July 
temperatures? Is there a significant difference between the variability of 
temperatures in January and July? Is there a significant difference between 
the coefficients of variation for the two months? 

16. On page 254 are given certain figures for the weights of male and 
female rats. Is there a significant difference in variability of weights 
between the sexes? 

17. Are female rats significantly different from Hollins College girls in 
variability of weight? Compute the standard error of the difference 
between the two coefficients of variation, getting the basic figures from 
pages 252 and 254. 

18. A Study of milk consumption in metropolitan Boston m December, 
1930, showed that the average per-capita consumption of milk was 0.391 
qt. ± 0.00262 qt. 3 Explain the meaning of those figures when taken in com¬ 
bination. What was the standard error of per-capita milk consumption? 

19. In the period 1925-1927 the average operator’s income on 105 Con¬ 
necticut tobacco farms growing Havana seed tobacco was $905. The 
average operator’s income on 97 Connecticut tobacco farms growing broad- 
leaf tobacco was — $450 (that is, a loss of $450). The standard deviations 
of the operators’ incomes were $1409 on the farms raising Havana Seed 
tobacco and $2305 on the farms raising broadlcaf tobacco. 4 Was there a 
significant difference between the operators’ incomes on these two groups 
of farms? 

20. Try drawing at least 25 different samples of 10 numbers from the 
array in Sec. 4.4, page 67, so selected that in each case the average will bo 
100, or, what amounts to the same thing, so l hat the total will be 1000. 

1 Quoted in Donaldson, op. cit ., p. 50. 

2 Based on figures in Whipple, “Vital Statistics/ 7 John Wiley & Sons, 
Inc., New York, 1923, p. 377. 

3 Based on figures in F. V. Waugh, Consumption of Milk and Dairy 
Products in Metropolitan Boston in December, 1330, New England Council 
on Marketing and Food Supply, September, 1931, pp. 4 and 11. 

4 Based on data on C. I. Hendrickson, An Economic Study of the Agricul¬ 
ture of the Connecticut Valleys, Storrs Agricultural Experiment Station 
Bulletin 165, pp. 123 and 142. 



CHAPTER X 


THE ANALYSIS OF VARIANCE 

10.1. Combining Frequency Distributions. —We noted in Sec. 
5.16 that, if we have data for several separate frequency distri¬ 
butions and combine them into a single distribution, we can find 
the arithmetic mean of the combined distribution directly from 
the means of the constituent distributions. Similarly we noted 
in Sec. 6.10 that we can find either the standard deviation or the 
variance of the combined distribution from the standard devia¬ 
tions or variances of the constituent distributions. Let us take 
a simple example. Table 10.1 shows the distributions of birth 
weights of 102 Guernsey calves and of 113 Ayrshire calves. 1 


Table 10.1. —Birth Weights of 215 Calves 


Birth Weight 

Number of 

(pounds) 

Guernseys 

Ayrshires 

40-49 

6 

0 

50-59 

22 

8 

60-69 

45 

31 

70-79 

23 

53 

80-89 

0 

13 

90-99 

0 

8 

Totals 

102 

113 


Using methods long since familiar to us, we compute the arith¬ 
metic means and standard deviations of the two distributions as 
follows: 



Arithmetic 

Standard 

Number of 


Mean 

Deviation 

Cases 

Guernseys 

65.10 

9.55 

102 

Ayrshires 

73.40 

9 65 

113 


1 Data from Fitch et al., Journal of Dairy Science, Vol. 7, p. 230. 

264 
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If we now wish to know the arithmetic mean, the standard devia¬ 
tion, and the number of cases which we would get if we were to 
put all 215 calves together in a single distribution, we proceed in 
accordance with the rules of Secs. 5.16 and 6.10. Obviously the 
number of cases in the combined distribution will be the sum of 
the numbers of cases in the two separate distributions: 


N = ni + n 2 


In our case we get 102 + 113 = 215. The arithmetic average 
of the 215 weights will he the weighted arithmetic average of 
the averages of the two constituent distributions. This gives us 


65.10 X 102 - 6,640.20 
73.40 X 113 = 8,294.20 


X = 


215 

14,934.40 

215 


14,934.40 
= 69.46 


The variance of the combined distribution can be found from the 
variances of the two constituent distributions. Since the vari¬ 
ance is the square of the standard deviation, we see that the vari¬ 
ance for the Guernsey calves was 91.20 and for the Ayrshires, 
93.12. The average weight for the Guernseys was 4.36 lb. below 
the average for the combined distribution, and the Ayrshire aver¬ 
age was 3.94 lb. above. Thus the values of d\ and d 2 are —4.36 
and -4-3.94. Substituting these values in our formula of Sec. 
6.10 we get 


v 


102(91.20) + 113(93.12) 4- 102(-4.36) 2 + 113(3.94) 2 
215 


23,517 74 
215 


109.38 


The variance of the combined distribution is 109.38, and the 
standard deviation of the combined distribution is the square root 
of 109.38, or 10.46 lb. We note that the standard deviation of 
the combined distribution is greater than the standard deviation 
of either constituent distribution, while the arithmetic mean of 
the combined distribution lies between the two constituent arith¬ 
metic means. 
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The formula for the variance of the combined distribution can 
be rewritten in the form 

U\V\ + n 2 V 2 , nidi 2 + n 2 d 2 2 
9 = —N - + - N - 

The first term on the right-hand side of this equation is the 
weighted arithmetic mean of the two constituent variances. But 
it is evident that the combined variance is greater than the 
weighted arithmetic mean of the two constituent variances, since 
we add to it the right-hand term, which is the weighted arith¬ 
metic mean of the squared differences between the means of the 
constituent distributions and the mean of the combined distri¬ 
bution. In other words, the variance of the combined distribu¬ 
tion can be thought of as being made up of two parts, the first 
of which depends on the variances of the constituent distribu¬ 
tions, and the other of which depends on the variation of the 
averages of the constituent distributions from the average of the 
combined distribution. In other words, the combined variance 
depends on 

1. The variance within the subgroups. 

2. The variance between the subgroups. 

In our present example we can say that the weighted average 
variance within the groups was 

nivi + n a v, _ 102(91.20) + 113(93.12) __ 19,824.96 _ 

N "215 " ' 215 

The variance between the groups was 

nidi 2 + n 2 d 2 2 __ 102( — 4.36) 2 + 113(3.94) 2 __ 

N ' “215““ " 

The total variance of 109.38 for the combined distribution is the 
sum of the variance within groups, 92.21, and the variance 
between groups, 17.17. The fact that we can think of the vari¬ 
ance of the combined group as being divisible into constituent 
variances is important, and should be kept in mind. 

Before we go on to push this point further, however, it should 
be pointed out that we can easily verify the figures we have been 
computing for this combined distribution. If we go back to 
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Table 10.1 on page 264 we see that we can easily cast it into a new 
combined frequency table showing the distribution of weights of 
all 215 calves. We merely add, within each class, the num¬ 
ber of Guernsey and the number of Ayrshire calves to get a 
combined total. This gives us Table 10.2, covering the com¬ 
bined frequencies. 

Table 10.2 —Weights op Calves, Based on Table 10.1 


Birth Weight 

Number of 

(pounds) 

Calves 

40-49 

6 

50-59 

30 

60-69 

76 

70-79 

76 

80-89 

19 

90-99 

8 

Total 

215 


If we compute the arithmetic mean, the variance, and the stand¬ 
ard deviation from this table by our usual methods we find the 
following: 

Arithmetic mean 69 46 

Standard deviation 10 46 

Variance . 109 5 

These correspond exactly with the values found by the formulas 
except for a disparity in the fourth significant figure in the case 
of the variance, which arises from rounding off. In other words, 
we see that the formulas give us the same results which we would 
get if we worked with the actual frequency distributions. By 
means of the formulas we can find the arithmetic mean, the 
standard deviation, and the variance of combined distributions 
from the arithmetic means, standard deviations, and variances of 
constituent distributions even if we do not have the full data for 
these constituent distributions themselves. Even if we had 
not been given Table 10.1 we could have found the statistics of 
our combined distribution if we were told that the two arithmetic 
means were 65.10 and 73.40, that the two standard deviations 
were 9.55 and 9.65, and that the numbers of cases were 102 and 
113. 
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Let us gather together the formulas which we use in combining 
statistics of constituent distributions to get the corresponding 
statistics of a combined distribution. They are 

(1) N = ni + n 2 + n 3 + • • * + n k 

(2) NX = n x X i + n 2 X 2 + n s X 3 + • * • + n k X k 

(3) Nv = (nil)i + n 2 v 2 + n z v z + • - • + n k v k ) 

+ (nidi 2 + n 2 d 2 2 + n 3 d 3 2 + * * • H- n k d k 2 ) 

where N is the number of cases in the combined distribution and 
n i, n 2 , n 3 , . . . , n k are the numbers of cases in the constituent 
distributions; X is the arithmetic mean of the combined distri¬ 
bution and X\, X 2) Xzy . • • , X k are the arithmetic means of the 
constituent distributions; v is the variance of the combined distri¬ 
bution and v h v 2 , v Zy . . . , v k are the variances of the constituent 
distributions; di, d 2 , d 3 , . . . , d k are the differences between the 
means of the constituent distributions and the mean of the com¬ 
bined distribution; and the constituent distributions are num¬ 
bered 1, 2, 3, , to k. 

10 . 2 . Contributions to Combined Variance. Let us note again 
the formula 

_ riyv i + n 2 v 2 nidi 2 + n 2 d 2 2 
v -"if + at 

As has been mentioned, this formula indicates that the variance 
of the combined group can be broken into two parts. The first 

part is 

niVi + n 2 v 2 
N 

If there were no variation whatever within the individual groups, 
then the standard deviations and variances of these groups would 
be equal to zero, and both riiVi and n 2 v 2 would be equal to zero. 
In such a case this entire first term would be equal to zero, and 
would make no contribution to the combined variance. This 
term indicates the contribution made to the combined variance 
by the variation or dispersion within the groups. The second 
part of the equation is 

nidi 2 + n 2 d 2 2 

N 


If the means of the constituent groups were all identical, then 
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both di and di would be equal to zero, and this term in its entirety 
would equal zero, and would make no contribution to the com¬ 
bined variance. This second term indicates the contribution 
made to the combined variance by the variation or dispersion 
among the different groups. Our over-all variance is, then, 
broken down into two parts, one reflecting the variance within 
the separate groups and the other reflecting the variance between 
the groups. 

10.3. Purpose of Analysis of Variance.—If we wish to know 
whether there is more difference in the average weight of newborn 
Guernsey and Ayrshire calves than would arise by chance from 
samples drawn out of the same universe—if, in other words, we 
want to know whether or not the difference in the arithmetic 
means of the two constituent groups is significant—we apply the 
tests for the significance of a difference which we learned in the 
preceding chapter. But let us enlarge our problem slightly. 
Table 10.3 shows the birth weights of 94 Jersey calves and 73 
Holstein calves in addition to the 102 Guernsey and 113 Ayrshire 
calves listed in Table 10.1. The data are from the same source 
as those of Table 10.1. 


Table 10.3.— Birth Weights of 382 Calves of Various Breeds 


Birth Weight 


Nu 

rnber of Calves 


(pounds) 

Jerseys 

Holstoins 

Guernseys 

Ayrshires 

Total 

40-49 

12 

0 

6 

0 

18 

50-59 

»2 

0 

22 

8 

72 

60-69 

31 

0 

45 

31 

107 

70-79 

! 9 : 

8 

23 

53 

93 

80-89 

! 0 i 

i IS 1 

6 

13 

34 

90-99 

i o 

i6 : 

0 

8 

24 

100-109 

0 

28 i 

0 

0 

23 

110-119 

0 

'I 1 

0 

0 

11 

Totals 

" 94 ~ 

73 i 

102 

113 

382 


Suppose, now, that I am interested in learning whether or not 
there is more difference among the breeds than would be expected 
on the basis of pure chance. Do the average birth weights of the 
four breeds differ among themselves more than would be expected 
if I took four groups all of the same breed? Suppose, for exam- 
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pie, that I study 382 newborn calves all of which are Guernseys. 
The groups contain 94, 73, 102, and 113 Guernsey calves. 
Almost certainly the average birth weights of these groups will 
differ. Is there any reason to suppose that the weights in Table 
10.3 differ any more than would have been true if there had been 
no differences in breed involved? 

Now this is a problem very similar to the problem of the signifi¬ 
cance of a difference between means. If we have but two groups 
we can test the significance of the difference in their means, and 
judge whether or not the difference is greater than would be 
expected in the difference of means of two samples selected at 
random from the same universe. With two values we can take a 
difference and test it. But how can we take a difference between 
more than two things? In Table 10.3 we have four things— 
Jerseys, Holsteins, Guernseys, and Ayrshircs. We could, to be 
sure, test the significance of the difference between any two 
means derived from this table. Or we could take all the possible 
pairs of means from the four means in the table and compare 
them. There are six such comparisons, and we could test the 
significance of each in turn. But we would still not have 
answered our general question, Is there a significant variation 
among the means in general? 

When we compare two values the variation between them is 
measured by their difference. But when we compare more than 
two values we have learned that the variation among them is 
measured by the standard deviation or by the variance. While 
we cannot talk sensibly about the difference between four things, 
we can sensibly talk about the variance among them. We can 
compute the four means from Table 10.3, and we can then com¬ 
pute the variance among these means. We can then ask our¬ 
selves whether the variance among the means is greater than we 
would expect to get among four means drawn from the same uni¬ 
verse. We can ask ourselves whether the variance among the 
means is significantly greater than the variance within the sepa¬ 
rate groups. We can ask ourselves whether the variance among 
the means of the different breeds is significantly greater than the 
variance within the individual breeds. 

But this goes back to the kind of problem which we faced in the 
preceding section, where we broke down the variance of a com¬ 
bined group into two parts, one of which reflected the variance 
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within the constituent groups and the other of which reflected 
the variance between the constituent groups. We might well 
compute such variances for the data of Table 10.3, and find what 
part of the total variance among the 383 weights listed there is 
contributed by the variance within the individual breeds, and 
wdiat part is contributed by the variance between the breeds. 
This is the purpose of the method which is called analysis of vari¬ 
ance. This method, originated by R. A. Fisher in 1923 and 
developed since by George W. Snedecor and many others, is one 
of the most powerful and versatile additions to statistical meth¬ 
odology in the past generation. While its early application has 
been primarily in the fields of agriculture and biological science, 
it is rapidly invadingother fields, and is quite generally applicable. 

10.4. The Concept of Degrees of Freedom.—Suppose that you 
are told to take any five numbers whose arithmetic average is 10. 
You are obviously completely free to choose any five numbers 
which you wish. Or at least, you appear to be completely free. 
But if you start out more or less at random, and select first the 
numbers 3, 2, 9, and 12 you find that your freedom has disap¬ 
peared. Having selected these four numbers, there is only one 
number which you can now possibly select if the average is to be 
10. If five numbers are to have an average of 10, their sum must 
be 50; and since the four numbers already selected have a sum of 
26, the fifth number must be 24. You were told to select “any” 
five numbers, but when you were given the added limitation or 
restriction that the average must be 10 you lost one degree of 
freedom . Instead of being free to choose any five numbers you 
were really free to choose any four, and the fifth was then fixed 
by the terms of the problem. 

It may appear that you were not really completely free to 
choose even the first four numbers. At first sight the student is 
apt to think that even with these numbers he must select them in 
such a way that their sum is 50 or less. But a moment’s reflec¬ 
tion will show that this is not the case. If the first four numbers 
selected are 15, 17, 10, and 16, with a sum of 58, we now merely 
take —8 as our fifth and final figure and meet the requirement 
that the mean be 10. If we include the possibility of using nega¬ 
tive numbers, we are really completely free to select any four 
numbers, but we have no choice at all in selecting the fifth. We 
have four “degrees of freedom” in our problem. The degrees of 
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freedom in any problem are the number of values which we are at 
liberty to select at will, but do not include those values which are 
fixed in size by the character of the problem itself. When we are 
computing an arithmetic mean, the number of degrees of freedom 
is always one less than the number of cases studied. When, in 
an earlier chapter, we computed the arithmetic average of the 
heights of 1000 Harvard students, we had 999 degrees of 
freedom. 

The mathematically minded student can prove, and other 
students can discover by experiment, that when we compute a 
standard deviation or a variance we lose two degrees of freedom. 
In such cases the number of degrees of freedom is two less than 
the number of cases. Let us take again a simple example. You 
are told, let us say, to select “any” five numbers in such a way 
that their arithmetic mean is 10 and their standard deviation is 
also 10. Note that this time our problem has an added restric¬ 
tion. We start, as before, to select numbers, and, as before, our 
first three numbers are 3, 2, and 9. There are limitless pairs of 
numbers which we can select for our fourth and fifth numbers and 
still have a total of 50 and an average of 10. But if we are to have 
an average of 10 and a standard deviation of 10, we have already 
used up all our degrees of freedom, as we can quickly demon¬ 
strate. Having selected our first three numbers at will, the 
other two numbers are both determined by our requirement that 
the arithmetic mean and the standard deviation equal 10. 

Our first three numbers have a sum of 14. The other two 
numbers must have a sum of 36. Let us call them m and 36 — m. 
The formula for the standard deviation is 

[Sx 2 

* = \j-~N 

Since in our problem the standard deviation is required to be 10 
we may write 

- 10 

Consequently, 

~ = 100 and Xx i = 500 
o 

Let us now write down in tabular form our five values and com- 
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pute their standard deviation by accustomed methods, as in 
Table 10.4. 


Table 10.4. —Computation of Standard Deviation 


Original Values 

(X) 

X 

2* 

3 

-7 

49 

2 

~8 

64 

9 

-1 

1 

rn 

m — 10 

m 2 - 20m -f 100 

36 — m 

36 - in - 10 

m 2 — 52 m -f* 676 

Sums 50 

0 

2 m 2 - 72m + 890 


We discover from the last column that 2x 2 = 2m 2 — 72 m + 890, 
but since we already know that 2x 2 = 500 we can write 

2rn 2 — 7 2m + 890 = 500 

Solving this quadratic equation for m, we discover that m — 6.64 
and 36 ~ ra = 29.36. These two values, then, are the remaining 
two values, and the only remaining two values which can be 
chosen if the sum of the five values is to equal 10 and their stand¬ 
ard deviation is also to equal 10. Our five values are 3, 2, 9, 6.64, 
and 29.36. 

Each time we put a new restriction on our problem, we lose a 
degree of freedom. When we were fitting frequency curves in 
Sec. 8.12, we ran across the same situation. And in Chap. IX, 
when we were computing the standard error of the arithmetic 
mean, we lost one degree of freedom by forcing the universe and 
sample to agree in one particular, namely, in having the same 
standard deviation. Consequently in that case we divided by 
N — 1, the number of degrees of freedom, instead of by N, the 
number of cases. 

In a footnote on page 138, Sec. 6.6, we noted that one can com¬ 
pute the standard deviation of a set of values by means of the 
formula 



Obviously the variance, which is the square of the standard devia¬ 
tion, is 
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_ XX 2 _ XX 2 (XX) 2 
v ~ N A N ~N r ~ 

Since we know from our earlier formulas that the variance may 
also be defined as lx 2 /N, we can write 

2a: 2 2X 2 _ (2X) 2 

N N N 2 


Dividing through by N we get 


2a: 2 = 2X 2 - 


(2X) 2 

~N 


This value, 2a: 2 , we call the “sum of the squares,” and we shall 
use it often in our work on analysis of variance. But now when 
we want the variance (sometimes alternatively called the “mean 
square”) instead of dividing the sum of the squares by N , we shall 
divide it by the number of degrees of freedom. If we let d.f. 
represent the number of degrees of freedom we can define the 
variance or mean square as 


• r . (sum of squares) 

Variance = mean square ~ --^ ^- 


2X 2 - 


(2X) 2 
" N 


dJ. 


Of course, if we are working with frequency distributions we can 
substitute the familiar “short-method” formulas of Chap. V, 
namely, 


2(/r/ 2 ) - 

Variance - mean square = - , f 


S ifd ) 2 
N 


To take a simple example, if we are given the numbers 3, 2, 9, 
6.64, and 29.36, and wish to compute their variance, we proceed 
as follows: 

X X 2 

3 9 

2 4 

9 81 

6.64 44.0896 

29.36 862.0096 

1000.0992 


Sums 


50 
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The sum of the squares, or 2x 2 , is 

1000.0992 - (50) 2 /5 = 500.0992 

The variance, or mean square, is 500.0992 divided by the number 
of degrees of freedom. If, in our original problem, we were free 
to write down any five numbers whatever without restrictior 
then d.f. — 5 and the variance is 

500.0992/5 = 100.0198 

If, on the other hand, we had the situation which existed when 
we first ran across these five values, there were but three degrees 
of freedom, and the variance is 500.0992/3 or 166.6997. We 
shall find it necessary from here on, when we are computing vari¬ 
ances, to note carefully the number of degrees of freedom in our 
problem. It is evident that this will be especially important 
when we are dealing with small samples, since in such cases the 
difference between N on the one hand, and N — 1 or N — 2 or 
N — 3 on the other hand, is relatively large. 

10.6. Computation of Analysis of Variance.—We are now ready 
to proceed with a simple example of analysis of variance. The 
method is applicable to a great many very diverse sorts of prob¬ 
lems, and in an elementary treatment of this sort it is impossible 
to indicate all the types of cases in which the method is useful. 
We shall illustrate, however, with a typical kind of problem, 
which will make it possible to describe the major features of 
computation, and, more important, the method of interpreting 
the results. 

Table 10.5 gives a series of frequency distributions 1 showing 
the I.Q.’s (intelligence quotients) of 8109 children selected at 
random from youngsters just starting the work of the seventh 
grade in public schools of five United States cities in the fall of 
1948. These figures appear in the upper part of the table, above 
the horizontal dividing lines. There is a distribution for each 
month, and these monthly distributions are gathered together 
into a total frequency distribution at the extreme right. Each 
figure in the right-hand column is the sum of the 12 figures imme¬ 
diately at its left. 

Even casual inspection of the table will show that there are 

1 Hans C. Gordon and Benjamin J. Novak, I.Q. and Month of Birth, 
Science , Vol. 112, No. 2898, July 14, 1950, pp. 62-63. 
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differences in the monthly distributions. We see, for example, 
that the figures in the January column are concentrated at a 
somewhat lower point than are the August figures. This becomes 
even more evident if we compute the monthly average I.Q.’s by 
the methods which we learned in Sec. 5.3. These averages are 
as follows: 


Jan 

94 67 

May 

97 28 

Sept . 

.. . 97.96 

Feb... 

94 61 

June 

97 01 

Oct 

. . . 96.80 

Mar.. 

94 12 

July 

97 71 

Nov 

97 92 

Apr... 

95 55 

Aug 

98 54 

Dec . 

94 64 


Average for the year 


96.59 



The averages vary in size. In fact, it would be surprising if they 
did not. Even if we took 12 samples from exactly the same uni¬ 
verse we would expect some variation by chance alone, just as 12 
hands of cards dealt successively from the same deck do not all 
contain the same number of spades or the same number of black 
cards. We are not so much interested in the fact that the aver¬ 
ages vary, as in the question whether they do or do not vary more 
than would be expected merely by chance. 

We might, of course, compare the largest and the smallest 
average, and test the significance of the difference between them 
by the methods of the preceding chapter. The largest average 
is 98.54 in August, and the smallest is 94.12 in March. The dif¬ 
ference is 4.42. But if we were to test the significance of this 
difference, it would be important for us to realize that we should 
not interpret it exactly as we have learned to interpret other such 
measures, for this time we did not choose a difference at random. 
We purposely chose the largest difference there was. There are 
many differences which could be taken between pairs of 12 num¬ 
bers. In fact, there are 66 different possible pairs of numbers 
between which differences could be taken, and we have purposely 
chosen the largest one. If we were to find that the difference is 
so great that it would happen only 1 time in 20 by pure chance 
(that is, if it exceeds the 5 per cent fiducial limit) we should not 
be surprised, because we know already that it was the largest 
difference of 66. The student must realize when he studies large 
numbers of cases that he should expect to have a few things hap¬ 
pen which do not often happen by chance. If we study a million 
cases we should have one thing happen which would occur by 
chance only one time in a million. Therefore our first inclination 
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to test the significance of the greatest of the differences is not so 
straightforward an approach as might appear. 

What we evidently need to do in this case, as was suggested in 
Sec. 10.3, is to discover whether the variance among the 12 aver¬ 
ages is greater than would be likely to occur on the basis of chance 
in samples drawn from original data which vary as much as these 
data do vary. The children born in January varied among 
themselves, as did the children in each other month. With all 
this variation within the months, how much variation can we 
reasonably expect between the monthly averages before we start 
to say, “This did not happen by chance. There is really a funda¬ 
mental and basic difference between the I.Q.’s of children born 
in different months.” We now proceed to find a quantitative 
answer to this question. Our steps are recorded in the lower 
half of Table 10.5, below the horizontal dividing lines. 

In line 1 we merely record the totals of the columns. Thus the 
figure 689 in line 1 of the May column is the sum of the 10 figures 
above. It indicates that there were altogether 689 children in 
the study who were born in May. 

The figures in lines 2 and 3 are derived from tables for finding 
the arithmetic mean and the standard deviation by the short 
method, as described in Table 6.5 on page 141. We work out 
such a table for each column of Table 10.5, and record the values 
of 'L(fd) and 2(/d 2 ) in lines 2 and 3. To make sure that this is 
understood, we give the computations for the January column 
and for the total column in Table 10.6. It will be noted that the 
appropriate totals from these tables appear in lines 1, 2, and 3 of 
Table 10.5. It is advisable in these computations to use the 
same guessed mean for each column of Table 10.5, since we are 
then provided with some simple checks on the accuracy of our 
arithmetical computations. Other similar computations are 
made for the other months of Table 10.5, and the totals are 
carried to lines 1, 2, and 3 of Table 10.5. And as a check on our 
arithmetic, the sum of the 12 items for January to December in 
any of these lines should equal the figure in the total column as we 
computed it in Table 10.6. We can find these values (8109; 1287; 
and 22,449) either by adding the 12 figures at the left in Table 
10.5, or by computing them as the totals in Table 10.6. 

Lines 1, 2, and 3 at the bottom of Table 10.5 give, respectively, 
then, the values of N , 2(/d), and 2(/d 2 ) for the various columns. 
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To get line 4, we square the figure in line 2 and divide the square 
by the figure in line 1. For example, for January the figure in 
line 2 is —21. We square this to get 441. We then divide 441 
by the January figure of line 1, which is 632. The quotient, 
44 J^32> is 0.7, which we enter under January in line 4. The 
other figures in line 4 are found similarly, and evidently they can 
be symbolized as 2(Jd) 2 /N. 

Finally we find the values of line 5 in Table 10.5 by subtracting 
each item in line 4 from the item above it in line 3. Thus for 


Table 10.6.— Intermediate Computations for Analysis of Variance 
of Data in Table 10.5 


Class 

January Figures 

Figures for Totals 

Mark 

/ 

d 

id 

[ J* 

/ 

d 

fd 

fd 1 

146 

2 

5 

10 

50 

17 

5 

85 

425 

135 

9 

4 

36 

144 

145 

4 

580 

2,320 

125 

36 

3 

108 

324 

504 

3 

1 ,512 

4,536 

115 

74 

2 

148 

296 

1,035 

2 

2,070 

4,140 

105 

107 

1 

107 

107 

1 ,656 

1 

1,656 

1,656 

95 

145 

0 

0 

0 

1,915 

0 

0 

0 

85 

138 

-1 

-138 

138 

1 552 

-1 

-1,552 

1,552 

75 

81 

— 2 

- 162 

324 

896 

-2 

-1,792 

3,584 

65 

j 30 

-3 

-90 

270 

284 

— 3 

-852 

2,556 

55 

10 

~4 

-10 

160 

105 

-4 

-420 

1,680 

Totals. 

632 


-21 

1 813 

8,109 


1,287 

22,449 


January we find that 1813 — 0.7 = 1812.3, and similarly for the 
other months. These procedures for lines 4 and 5 are followed 
in all columns except the final total column. The entries in lines 
4 and 5 in the final totals column (the numbers385.4 and 22,063.6) 
are the sums of the 12 numbers immediately at the left—the sums 
for the months from January through December. We cannot 
apply the check in these cases of computing the same result two 
ways, as we did in lines 1, 2, and 3. 

The figures in line 5 are evidently values of 2(/d 2 ) — 2(fd) 2 /N. 
But this is what we learned in Sec. 10.4 to call the “sum of 
squares.” Our computations at the bottom of Table 10.5 are 
really directed toward finding systematically the sums of squares 
for the months. Having found them for the individual columns, 
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we proceed to find them also for the table as a whole, thus: 

1. Sum of squares for the whole table: 

22,449 - = 22,449 - 204.3 - 22,244.7 

2. Sum of squares between columns: 

385 4 - - 385.4 - 204.3 - 181.1 

3. Sum of squares within columns: 

The sum of line 5 in Table 10.5 = 22,003.6 

The items used in finding these sums of squares are all taken from 
the final column of Table 10.5. 

It should be noted that the sum of squares for the entire table 
is also the sum of the other two sums of squares. That is, 

22,244.7 = 181.1 + 22,063.6 

This relationship will always hold true, and provides us a further 
convenient check on the accuracy of our arithmetical computa¬ 
tions. It is clear that we have succeeded in breaking the sum of 
squares of our problem into two parts, one of which (181.1) 
reflects the variation between columns, and the other of which 
(22,003.6) reflects the variation within the columns. Whereas 
we noted early in this chapter that one could combine variances 
as separate constituent groups to get the variance of the com¬ 
bined group, we have been reversing the process, and breaking 
dowm the sum of squares for an entire problem into its constituent 
parts. 

The three sums of squares are now converted to variances or 
mean squares by dividing them by the appropriate numbers of 
degrees of freedom, as explained in Sec. 10.4. For the entire 
table we have one fewer degree of freedom than the total number 
of cases studied. Since our problem covered 8109 cases, we have 
8108 degrees of freedom. Between columns we have one fewer 
degree of freedom than the number of columns. Our problem 
covered f2 columns (12 months), and here we have 11 degrees of 
freedom. Within the columns we lost one degree of freedom for 
each column, or 12 degrees of freedom over all. Therefore 
within columns our number of degrees of freedom is 

8109 - 12 - 8097 
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If we let N represent the total number of cases studied (8109 
in this problem) and m represent the number of categories com¬ 
pared (such as the number of months of birth, or the number 
of breeds of cattle, etc.), then our numbers of degrees of freedom 
are: 

For the whole table . N — 1 

Between columns ... m — 1 

Within columns N — m 

Using these values we compute our variance thus: 



Sum of 
Squares 

Degrees of 
Freedom 

Variance or 
Mean Square 

Whole table. 

22,244.7 

N — 1 - 8108 

2 744 

Between columns 

181 1 

m - 1 - 11 

16 5 

Within columns 

22,063.6 

N - m - 8097 

2.72 


The variances which we find by dividing the sums of squares by 
the corresponding number of degrees of freedom are the end 
products of analysis of variance. We ask ourselves now, “Is it 
unusual to get as much variance as 1G.5 between the I.Q.’s of the 
various months when there is a variance of only 2.72 within the 
months themselves? Is such a great variance between months 
likely to occur by pure chance when samples are drawn from the 
same universe?” To answer this question we merely compare 
the two variances, dividing the larger mean square by the smaller. 
Thus in our problem of intelligence quotients we compute 


F 


16J5 

2.72 


0.07 


This quotient is called F, in honor of R. A. Fisher, who originated 
the methods which we have been applying. We see that the 
variance between the means of columns is 6.07 times as great as 
the variance within the columns themselves. If the columns 
were substantially alike, as they would be if they were drawn at 
random from the same universe, we would expect these two mean 
squares to be equal, and F to have a value of I. If F does have a 
value of 1 or less (that is, if the variance between columns is equal 
to or smaller than the variance within columns), we realize that 
there is no significant variation between the columns—that the 
columns differ no more than would be expected if they had been 
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drawn from the same universe. For example, if the value of F 
in our problem of intelligence quotients had turned out to be 1 or 
less, instead of 6.07, we would have concluded that, while there 
were differences in the monthly distributions of I.Q/s, these differ¬ 
ences were no greater than one would expect by chance, and there 
was no reason to suppose that there was any real fundamental 
difference in the I.Q.’s of children born in different months. 

When, as in our case, the value of F exceeds 1, we still need to 
decide whether it is enough greater than 1 to warrant the con¬ 
clusion that it did not happen by chance. How large an F value 
can occur by chance alone? This depends on the numbers of 
degrees of freedom involved. With very small samples we can 
get very great F values purely by chance, as the student will 
realize. We could toss a well-balanced penny once and get a 
“head.” We could toss it again and get a “tail.” How many 
times as many heads did we get the first time as the second? One 
is how many times as great as zero ? We see that when the sample 
is small we may get infinitely large ratios of results without any 
real significance at all. Yet if our samples are large, even rather 
small ratios may be important. If we toss two pennies, each 
10,000 times, and one of them turns up “heads” 52 per cent of 
the time while the other turns up “heads” 48 per cent of the 
time, we would be well justified in thinking that there was a real 
difference in the pennies, and that the difference in their perform¬ 
ances was not a chance affair even though the ratio of the results 
was as 52/48 or only 1.08. Our intuition leads us correctly to 
discount the infinite proportional difference between single throws 
of pennies, but to give serious weight to even small proportionate 
differences based on large samples. 

What we need, then, is some simple way of deciding whether a 
given F value is large or small in the light of the number of 
degrees of freedom on which it was based. The chances of 
occurrence of such F values have been computed, and are tabu¬ 
lated in Appendix III starting on page 516. This table lists the 
5 per cent points and the 1 per cent points of F for various num¬ 
bers of degrees of freedom. We always enter this table by letting 
n x across the top represent the larger of our two variances or 
mean squares. In our present case, then, we select the column 
headed 11, since our larger variance was based on 11 degrees of 
freedom. In the left-hand column we look for an n 2 value equal 
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to the number of degrees of freedom of our smaller mean square. 
In our problem the smaller mean square was based on 8097 d.£., 
and consequently we look for the value 8097 in the left-hand 
column of Appendix III. We do not find this value, since natu¬ 
rally the table cannot list every possible value. When the num¬ 
ber of degrees of freedom is as large as 8097, we shall not be far 
wrong if we decide to use the ti 2 value of infinity, <», at the 
extreme bottom of the table. We then find that when there are 
11 degrees of freedom for the greater variance and infinite number 
of degrees of freedom for the smaller variance, the 5 per cent 
limit for F is 1.79 and the 1 per cent limit is 2.24. These values 
tell us that random variation would give an F value greater than 
1.79 only 5 per cent of the time, and that F values exceeding 2.24 
would occur by chance but 1 per cent of the time. Our F value 
of 6.07 is far greater than the 1 per cent limit, and evidently would 
occur far less often than 1 per cent of the time by chance. If we 
make such statements as that the variation between months did 
not happen by chance, we shall be wrong less than 1 per cent of 
the time. As a matter of fact, more extensive tables would show 
that there is less than one chance in a thousand of such an F value 
arising by chance alone. 

When our F value is larger than the 5 per cent point but smaller 
than the 1 per cent point, we know that the probability of such 
an occurrence arising by chance is less than 0.05 but greater than 
0.01. * If the F value of our problem is greater than the 1 per cent 
point, we know that the probability of such an occurrence arising 
by chance is less than 0.01. But if the F value of our problem is 
smaller than the 5 per cent point, we know that such an occur¬ 
rence would arise by chance oftener than five times m 100. It is 
customary when the F value is smaller than the 5 per cent point 
to say that the variation is not significant; if the F value lies 
between the 5 per cent and the 1 per cent points, to say that the 
variation is significant; and if the F value is greater than the 1 
per cent point, to say that the variation is highly significant. 
Thus in our problem of the monthly distribution of I. Q.’s we 
would conclude that there is highly significant variation among 
the monthly distributions. We would conclude that there are 
far greater differences between the months than would be likely 
to occur by chance if we had drawn 12 distributions from a homo¬ 
geneous universe. 
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The student should understand that this is as far as we can go 
as statisticians in interpreting the results. The statistician can¬ 
not say whether the differences in I.Q. among children born in 
different months were caused by differences in temperature during 
the prenatal period, or whether intelligence quotients really meas¬ 
ure intelligence or not, or whether the problem was ever an impor¬ 
tant one to study. Decisions on these and hundreds of other 
legitimate questions must be answered by the physician or the 
physiologist or the psychologist or the educator or by someone 
with specialized knowledge in the field from which the subject 
matter was drawn. Knowledge of statistical method alone is 
never sufficient to warrant the drawing of final and dogmatic 
conclusions in a field in which the statistician is not expert. The 
skills of the statistician must be combined with the skills of other 
fields if any of the skills is to work to complete advantage. 

10.6. Summary of Steps in Analysis of Variance. —The method 
of analysis of variance is applicable to a wide range of problems 
which differ in many respects among themselves. The prac¬ 
ticing statistician soon learns that no two problems are quite the 
same, and that they never match exactly the illustrative examples 
of the textbooks. The elementary treatment in an introductory 
text can do no more than hint at the possibilities and help the 
student to become familiar with the interpretation of results. 
Perhaps the commonest cases, however, are those in which the 
statistician is comparing several frequency distributions, as we 
did in Table 10.5. In such cases we follow the procedures which 
were illustrated at length in the preceding section, and which 
can be summarized as follows: 

1. Set up a table similar to Table 10.5, listing in parallel columns the 
frequency distributions which arc to be compared. There are m such dis¬ 
tributions, listed in m parallel columns. 

2. Add at the right a column of totals, obtained by adding across the 
table the frequencies within each class. This will make m + 1 columns in 
the table (not counting, of course, the column which lists the class limits). 

3. For each column compute the values of 2/, 2/d, and 2I/d* as in Table 
10.6. Enter these results in lines 1, 2, and 3 at the foot of the table. 

a. Use the same guessed mean for all columns. This makes it possible 
to check the accuracy of arithmetical computations in the totals column of 
lines 1, 2, and 3. 

4. Square the entries in line 2, and divide these squares by the entries in 
line 1. This will give for each column a value of (2 fd^/N, which should 
be listed in line 4. 
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5. Add the values in line 4 to get the final line 4 value in the column of 
totals. 

6. Subtract each entry in line 4 from the entry above it in line 3. This 
will give for each column a value of 2(/d*) — (2fd)*/N. These are the 
sums of squares of the various columns, and should be entered in line 5. 

7. Add the values in line 5 to get the final line 5 value in the column of 
totals. This will complete the entire table similar to Table 10.5. 

8. Find the additional required sums of squares as follows: 

a . For the whole table: Square the item in line 2 of the totals column, and 
divide by the item in line 1 of the totals column. Subtract the result from 
the item in line 3 of the totals column. 

b. Between columns: Square the item in line 2 of the totals column, and 
divide by the item in line 1 of the totals column. Subtract the result from 
the item in line 4 of the totals column. 

c. Within columns: The sum of the squares within columns requires no 
further computation. It is the item in line 5 of the totals column. 

9. At this point we have another check on the accuracy of our arith¬ 
metical computations. The sum of squares for the whole table should be 
also the sum of the other two sums of squares. In other words, the item 
found in Step 8o should be the sum of the items found in Steps 86 and c. 

10. We now find the mean squares, or variances, by dividing each of the 
three sums of squares found in Step 8 by the appropriate number of degrees 
of freedom, as follows: 

a. For the whole table: Divide by N — 1, which is one less than the value 
of the item in line 1 of the totals column. 

b. Between columns: Divide by m — 1, where m is the number of fre¬ 
quency distributions being compared, as defined in Step 1. 

c. Within columns: Divide by N — m. That is, subtract the value of m 
used in Step 1 from the item in line 1 of the totals column, and use the 
difference as the divisor. 

11. The values found in Step 10 are the mean squares, or variances. 
Divide the mean square between columns (Step 96) by the mean square 
within columns (Step 9c). The quotient is the value of F. 

12. Find the 5 per cent and the 1 per cent points of F from the table in 
Appendix III. Use the column in that table headed with the number of 
degrees of freedom of Step 106. Use the row at the right of the number of 
degrees of freedom of Step 10c. Where this column and row intersect will 
be found the values of the required 5 per cent and 1 per cent points. (If 
the value of F turns out to be smaller than 1, we know without referring to 
the table that there was no significant variation between the columns.) 

13. It is customary to interpret the results as follows: 

a. If F is smaller than the 5 per cent point, the variation between columns 
is not significant, and might easily have arisen by chance. 

b. If the value of F lies between the 5 per cent and the 1 per cent points, 
the variation between the columns is significant, and probably did not arise 
by chance. 

c. If the value of F is greater than the 1 per cent point, the variation 
between the columns is highly significant, and almost certainly did not arise 
by chance. 
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10.7. Suggestions for Further Reading. —The student who wants to go 
right to headquarters in studying analysis of variance will certainly read 
R. A. Fisher, “Statistical Methods for Research Workers,” which continues 
to appear in edition after edition through Oliver & Boyd, Edinburgh and 
London. Unfortunately the beginning student finds Fisher’s treatment 
difficult to follow, and he will probably prefer George W. Snedecor, “Calcu¬ 
lation and Interpretation of Analysis of Variance and Covariance,” Col¬ 
legiate Press, Inc., of Iowa State College, Ames, Iowa, 1934. “Statistical 
Methods Applied to Experiments in Agriculture and Biology,” put out by 
the same author and publishers in 1946, is also excellent, and gives a much 
more complete treatment than do most general statistics textbooks. Palmer 
O. Johnson, “Statistical Methods in Research,” Prentice-IIall, Inc., New 
York, 1949, has a good but mathematical treatment. Alexander M. Mood, 
“Introduction to the Theory of Statistics,” McGraw-Hill Book Company, 
Inc., New York, 1950, has an excellent treatment for the student who is 
reasonably at home with the calculus. A much more elementary x)resenta- 
tion appears in Frank A. Pearson and Kenneth R. Bennett, “Statistical 
Methods Applied to Agricultural Economics,” John Wiley & Sons, Inc , 
New York, 1942. Most of these books, and particularly those by Professor 
Snedecor, give a much greater number of far more diverse illustrations of 
the application of the methods than are included in this chapter. 

EXERCISES 

1. Compute the analysis of variance for the data of Table 10.1. Compute 
also the significance of the difference between the mean birth weights of 
Guernsey and Ayrshire calves from the same table, using the methods of 
the preceding chapter. Compare the results. 

2 . Compute the analysis of variance for the data of Table 10.3. Interpret 
the results. 

3. You are told to select any five numbers such that the arithmetic mean 
will be 10 and the standard deviation will be 2. You start out at random, 
selecting the numbers S, 9, and 11. Demonstrate that you have exhausted 
all the degrees of freedom. What are the only remaining numbers which 
can be selected to meet the requirements of the problem? 

4. How many degrees of freedom do I have if I am instructed to: 

а. Select several numbers? 

б. Select 10 numbers? 

c. Select 10 even numbers? 

d. Select 10 consecutive integers? 

e. Select 10 numbers whose sum is less than 8000? 

/. Select 10 numbers whose total is 28? 

g. Select 10 numbers whose total is 28 and whose arithmetic mean is 2.8? 

h. Select 10 numbers whose total is 28 and whose arithmetic mean is 4.6? 

». Select 10 numbers whose total is 61 and whose standard deviation 

is 4? 

j. Select four different even integers with values greater than 1 and 
smaller than 9. 

k. Select two numbers whose sum is 14 and whose difference is 4. 
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6. On page 273 appears the statement that ‘‘Each time we put a new 
restriction on our problem we lose a degree of freedom," Yet in the pre¬ 
ceding problem part c adds a restriction not present in part b without altering 
the number of degrees of freedom. In fact, there are several other cases in 
Exercise 4 where restrictions are added which do not alter the number of 
degrees of freedom. Can you restate the sentence quoted above from 
page 273 to make it more nearly accurate? 

6. The statement is made in Sec. 10.5 that when a student studies large 
numbers of cases he should expect to find some things actually occurring 
that would be extremely unlikely on the basis of chance. Suppose we 
shuffle a pack of cards and deal an ordinary bridge hand of 13 cards. The 
particular combination of cards in this hand actually did occur. How 
likely is it that such a combination should occur by pure chance? Note 
that things actually do happen many times every day by pure chance which 
are extremely unlikely when considered as isolated hypothetical cases. 

7. Work out a table similar to Table 10.6 for the February data of Table 

10.5, and check the results entered in lines 1, 2, and 3 at the foot of Table 

10.5. Be sure to select the class mark of the 90-99 class as the guessed mean. 

8 . A study was made to determine whether or not there waB significant 
variation in sizes of families between large cities, small cities, and rural areas. 
Analysis of variance gave the following results: 



Sum of 

Degrees of 

1 

Squares 

Freedom 

Total . . 

2 563 

23 

Between areas 

1 876 

2 

Within areas 

0 687 

21 


Complete the analysis, and interpret your results. 

9. In attempting to see if there is a significant difference between the 
prices paid for potatoes by families with different incomes, Pearson and 
Bennett 1 divided 354 families into four income groups and computed the 
analysis of variance. They found the following intermediate results: 



Sum of 

Degrees of 


Squares 

Freedom 

Total 

67 

353 

Between incomes 

13 

3 

Within incomes. 

54 

350 


Complete the analysis and interpret the results. 

1 Frank A Pearson and Kenneth R. Bennett, "Statistical Methods Applied 
to Agricultural Economics," John Wiley & Sons, Inc., New York, 1942, 
p. 358. 
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10. A university instructor in physics gets students in his elementary 
course from the colleges of liberal arts, agriculture, home economics, business 
administration, pharmacy, promedical, predental, and engineering—eight 
separate curricula altogether. He has 201 students in the elementary 
course, and he wonders if the students from the various curricula vary 
significantly in their grades. He applies analysis of variance, and finds a 
total sum of squares of 50,304 and a sum of squares between curricula of 
2598. Complete the analysis of variance and interpret the results. 



CHAPTER XI 


FITTING STRAIGHT LINES 

Among the most important problems faced by the scientist is 
that of measuring and describing relationships between things. 
Most scientific laws are statements of such relationships. For 
example, the law of gravitation tells us that the attraction 
between bodies depends on their masses and on the distance 
between them, and describes the nature of that relationship. 
The law of falling bodies tells us that the distance which a body 
falls depends on the time it is falling and on the attraction of 
gravity, and describes quantitatively the interrelationship. One 
of Kepler’s laws tells us that there is a relationship between the 
distance of a planet from the sun and the length of time which it 
takes for the planet to complete a revolution around the sun, 
and again gives a quantitative expression to the relationship. 
The scientist trying to determine when an ancient civilization 
flourished discovers that there is a relationship between the 
amount of radioactive carbon in the wood of ancient buildings 
and furniture and the length of time since the trees were cut to 
get the original lumber. And similarly over and over again 
scientists in all fields are testing and measuring in an effort to 
find relationships. 

11.1. The Use of Two Variables. —As soon as we begin to 

measure relationships we must, of course, treat two or more vari¬ 
ables at once. In one problem the variables may be distance 
from the sun on the one hand, and length of period of rotation on 
the other. Or there may be three variables studied together, as 
when we find the relationship between (1) the attraction of 
gravity, (2) the distance between the bodies, and (3) the masses 
of the bodies. Or an agricultural expert may be determining the 
effect of (1) amount of fertilizer applied, (2) amount of rainfall, 
and (3) average July temperature on (4) the yield of corn per acre. 
So the number of variables may increase to half a dozen or so, 
although for reasons we shall soon discover it is uncommon for 
the scientist to consider more than four or five at most in a given 
problem. 
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When we have two or more variables in a problem, it becomes 
necessary to set up some system for distinguishing them. If we 
are considering a single variable and talk about the mean or the 
standard deviation, everyone knows what mean or standard 
deviation is in question. But if we are studying the relationship 
between heights and weights of men, and talk about the standard 
deviation, no one can tell whether we mean the standard devia¬ 
tion of the heights or the standard deviation of the weights unless 
we specify. 

Having talked about but one variable at a time we have 
referred to whatever variable was being studied as X . In our 
formulas ZX has meant “the sum of the values of whatever thing 
we were studying/' But if we are studying both heights and 
weights, such symbolism would be ambiguous. Hence the statis¬ 
tician adopts one or the other of two conventional modes of 
expression. Suppose his problem is one of studying the change 
through time of wheat acreage and wheat prices. There are 
three variables: (1) wheat acreage, (2) wheat price, (3) time. He 
may distinguish them by assigning different letters to the different 
variables, thus: 

Let X represent time. 

Let Y represent wheat acreage. 

Let Z represent wheat price. 

Or he may distinguish them by using numerical subscripts, 
thus: 

Let Xi represent wheat price. 

Let Xi represent wheat acreage. 

Let Xz represent time. 

If he follows the first of these plans, he will refer to the averages, 
standard deviations, medians, quartiles, etc., as follows: 

X = average of the X’&; that is, average time 
¥ = average of the acreages 
2 — average price 

y — deviation of any acreage from the average acreage 
a •* = standard deviation of the prices 
Med. v = median acreage 

Qu = first quartile of the prices 

Zx 2 = sum of the squares of the deviations of the times 
from the average time 


etc. 
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If he follows the second plan, he will distinguish the variables by 
using numerical subscripts: 

Xi = average price 
X 2 = average acreage 

x% = deviation of a price from the average price 
cr 2 = standard deviation of the acreages 
Med. 3 = median time (the median year, the median week, 
or the median minute, depending on the periods 
into which time is divided in the problem) 

2^ 2 2 — sum of the squares of the deviations of the acreages 
from the average acreage, 
etc. 

Often when we use different letters to represent the variables 
it is helpful to use the initial letters of the variables rather than 
X, F, and Z. For example, we might use T, A, and P to repre¬ 
sent time, acreage, and price. Such usage has an obvious 
mnemonic advantage. 

11.2. The Nature of Relationship.—Any study of the nature of 
relationship or causation raises philosophical problems which are 
far too abstruse for discussion here. It seems to be true that no 
one knows very clearly what is meant by the statement that one 
thing “caused ” another thing. Yet for the purposes of everyday 
life there is enough meaning to the statement so that it helps men 
in their thinking, and gives them a convenient method of dodging 
the philosophical problems involved by means of an elliptical 
expression. 

When we say that two things are “related/ 7 we may mean that 
the connection between them is very definite and unchangeable, 
or we may intend merely to call attention to some sort of loose 
connection between the two. For example, we say that the 
circumference and the radius of a circle are related. Here the 
relationship is definite and unalterable, and can be expressed by 
means of the mathematical equation 

c = 2wr 

For any given radius there is one and only one circumference, and 
this relationship of r to c remains the same century after century 
without end. Suppose, however, we say that the price of pota¬ 
toes and the quantity produced are related, or that a child’s age 
and height are related. The connection in this case is by no 
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means so sharply defined. It is not true that for each child age 
there is one height and only one. Children of the same age vary 
in height. Similarly potato prices are not always equal for crops 
of equal size. 

Let us take another case which is somewhat more complicated. 
If we study the period of the pendulum, we discover that there is 
a relationship between the length of the pendulum and its period. 
If we let t represent the time of the vibration, l the length of the 
pendulum, and g the attraction of gravity, the relationship can 
be expressed by the formula 

Now it will be noticed that there is no single period of oscilla¬ 
tion which will occur with every pendulum of a given length. 
We cannot say that for each and every length of pendulum there 
is one and only one period of oscillation. As long as there are 
variations in the attraction of gravity we may get changes in the 
period of oscillation of a pendulum without changes in length. 
When we say, then, that there is a relationship between the 
length of the pendulum and its period, we do not mean that the 
relationship is a simple one. We do not mean that one can tell 
the exact period from a knowledge of the length. And when we 
say that there is a relationship between a child’s age and his 
height, we likewise do not mean that a knowledge of the age will 
make it possible for us to tell the exact height of the individual. 
There are, of course, other factors which are related to height (just 
as there was the additional factor of gravity also involved in the 
swinging of the pendulum), and it is possible that, if we knew 
them all and knew the facts with regard to their interrelation¬ 
ships, we could tell the exact height of the child just as we can tell 
the exact period of the pendulum if enough data are given. Some 
men assume that all facts are so related that if we knew enough 
about them we could explain them all by methods as satisfactory 
as those used to explain the swinging of the pendulum. 

In most statistical problems there are many variables, and the 
exact relationships which exist between them are unknown. We 
have no formulas from which we can give a complete mathemati¬ 
cal statement of the problem. As we saw in Chap. I, the statis¬ 
tician commonly deals with problems in which there are many 
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things varying at once, and where it is impossible on account of 
the nature of the data to hold forces constant. Hence relation¬ 
ships cannot ordinarily be stated so simply or so satisfactorily as 
can the relationship between the radius and the circumference 
of a circle. 

What, then, do we mean when we say that there is a “relation¬ 
ship ” between the height and the age of children ? We may mean 
any one of a number of things. We may mean, for example, that 
the average height increases (or decreases) with age, so that if we 
divide children into groups according to age we shall find changes 
in the average height accompanying changes in the age. We may 
mean that the dispersion of heights differs with age, so that the 
heights are more widely scattered at some ages than at others. 
We may mean that there are differences in skewness or kurtosis 
of the height distributions at different ages. If the frequency 
distributions of heights vary (more than they would vary as the 
result of chance) from one age to another, we should say that the 
ages and the heights are related. 

This principle can be stated to advantage in a somewhat differ¬ 
ent way. To be sure, the knowledge of a child’s age does not 
make it possible to estimate his height exactly. But does it help 
at all in estimating the height? Suppose that you have the 
problem of guessing the height of an unknown child; would it help 
you at all to be told that the youngster is two years old? This 
knowledge would not make it possible for you to tell the exact 
height, but it would make it possible for you to estimate the 
height with less error than would otherwise exist in your answer. 
In such cases, where a knowledge of the value of one variable 
helps us in estimating the value of another variable, we say that 
the two variables are “related.” This does not mean that one of 
them “causes” the other, but merely that the knowledge of the 
value of one is an aid to us in estimating the value of the other. 
Suppose that you have the problem of estimating the price which 
will be paid for potatoes at retail in New York City next fall. 
You know the following facts: 

1. Easter of the year in question falls on Apr. 4. 

2. The Philadelphia Athletics have a team batting average-of 0.231 on 
July 17 of the year in question. 

3. The quantity of potatoes harvested in the United States in the year 
in question is 400 million bushels. 

4. The price of rice is lower than it has been for 50 years. 
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Which of these facts will you consider when making your esti¬ 
mates? It is probable that you will give no weight at all to 
either of the first two. That will mean that in your opinion 
the price of potatoes is not related to the date on which Easter 
falls or to the team batting average of the Athletics. It may well 
be that you will consider the other two facts in making your 
estimate. This will mean that you think that there is some rela¬ 
tionship between the price of potatoes on the one hand and the 
size of the crop and the prices of other sources of carbohydrate 
food on the other hand. If you can make a better estimate of 
potato prices by considering any certain factor than you could 
without considering it, then that factor is related to the price of 
potatoes. This is all that is meant by “ relationship ” as the word 
is used here, and statistical investigations can determine no more. 
No statistical process can demonstrate cause and effect, but there 
are statistical processes, which we shall now consider, that show 
whether or not it is worth while to consider certain particular 
factors in making estimates of others. 

11.3. Simple Methods of Finding Relationships.—In our dis¬ 
cussion of the nature of relationships we suggested that if children 
were classified by age, and if the average height were computed 
for each class, we could then see whether or not there were varia¬ 
tions in the average heights at the different ages and thus infer 
something as to the existence or nonexistence of relationships. 
This is one of the simplest, easiest, oldest, and most satisfactory 
ways of discovering relationships and of presenting evidence as to 
the nature thereof. The common tables of height and weight are 
computed on such a basis. But the method can be applied in any 
field, and is usually one of the first methods used by a statistician 
who is investigating relationships of any kind. We can illustrate 
with Table 11.1, which shows the average systolic blood pressures 
of 50,000 persons carrying life insurance, classified according to 
age. This table shows that blood pressure rises, on the average, 
as ages increase. While there were doubtless a good many excep¬ 
tions, the typical situation is for older people to have higher blood 
pressure than younger people. We would conclude that age and 
blood pressure are related. When, as in this case, both variables 
tend to increase together (the high values of one being associated 
with high values of the other), we say that there is a positive or 
direct relationship. 
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In 1916 the United States Public Health Service made a study 
in seven South Carolina cotton-mill villages, and as part of the 
study they investigated the relationship between the size of the 
family income per person and the amount of sickness in the 
family. Counting only cases of disabling sickness, and stating 
the figures in rates per 1000 people, they found the sickness rates 


Table 11.1.—Average Systolic 
Blood Pressure at Various Ages 


Age 

(years) 

Average 

Systolic 

Blood 

Pressure 


(millimeters) 

15-19 

120 

20-24 

122 

25-29 

123 

30-34 

124 

35-39 

126 

40-44 

128 

45-49 

130 

50-54 

132 

55-59 

134 

60-64 

135 

65 and over. . 

136 


listed in Table 11.2. 1 These figures indicate that there is a rela¬ 
tionship between income and sickness rates, although they do not 
show whether the people were sick because of privations resulting 
from low incomes, or whether their incomes were small because 
they were sick and hence not working regularly. In other words, 
nothing is shown as to cause and effect, but evidence is presented 
that a relationship exists. In this case, as one variable grows 
larger the other grows smaller. As incomes rise, sickness rates 
fall. Such a situation is called a negative or inverse relationship. 

In order to show the difference in the results, let us examine a 
case in which computation of averages for the data when classi¬ 
fied into groups gave no evidence of relationship. In a study of 
the use of automobiles on New York State farms, figures are 
given showing the distance cars were driven per year in relation- 

1 Quoted by Douglas, Hitchcock, and Atkins, in their “Worker in Modern 
Society,” p. 318, University of Chicago Press, Chicago, 1925. 
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Table 11.2. —Relation of Sickness to Income, South Carolina, 1916 


Half-monthly Income per Adult Male 

Sickness 
Rate per 
1000 
Persons 

Less than $6. 


70.1 

$6-$7.99 . . 


48.2 

8- 9.99. 


34.4 

10 and over... . 


18.5 


ship to the distance that the owner lived from a paved road. 1 
They appear in Table 11.3. 

Table 11.3.— Relation of Season’s Mileage of Automobile to Dis¬ 
tance from Hard Road 


Miles to Hard Road 

Season’s 


Mileage 

0 . 

4385 

0 1-0 9 

4152 

1 0 and more 

4342 


Mere differences in the averages of groups give no conclusive 
evidence of the existence of relationship. As we have discovered 
in an earlier chapter (pages 264#*.), it would be advisable in such 
cases to compute the analysis of variance to see if we had more 
variance between groups than would be expected on the basis of 
chance. Our results would then be far more significant. 

One must not get the impression from the examples which have 
been used that it is necessary for the group averages to increase 
or decrease regularly and continuously throughout the table 
before we can draw the conclusion that a relationship exists. 
Examine Table 11.4, in which couples are classified according to 
the length of their married life, and in which for each group the 
number of divorces per 100 married population has been com¬ 
puted. 2 This table seems to show that the divorce rate rises for 
1 J. M. Bannerman, Economic Study of 941 Automobiles on New York 
State Farms, Farm Economies, Cornell University, June, 1931, p. 1565. 

a This is the first part of a long table in A. Cahen, “Statistical Analysis of 
Divorce," p. 120, Columbia University Press, New York, 1932. In the 
original the entries continue to 30 years, and the divorce rates continue to 
decrease throughout the remainder of the table. 
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the first few years and then falls. But this fact would not deter 
one from using a knowledge of the length of married life m esti¬ 
mating the likelihood of divorce. The relationship merely turns 
out to be curvilinear . If the averages differ by a fixed amount, so 

Table 11.4.— Relation of Divorce Rate to Length of Married Life 


Years of 
Married 
Life 

Divorces per 
100 Married 
Population 

1 

0.70 

2 

1.20 

3 

1.32 

4 

1.32 

5 

1.27 

6 

1.10 

7 

1.00 

8 

0.97 

9 

0.84 

10 

0 67 


that when graphed they would fall along a straight line or approx¬ 
imately so, we say that the relationship between the variables is 
linear. If the averages when plotted would fall along a smooth 
curve, or approximately so, we say that the relationship is curvi¬ 
linear. And if the averages when plotted do not seem to fall 
along any curve whatever, we say that there is no evidence of 
relationship between the variables. 

This method of discovering relationship by a comparison of 
group averages is exceedingly useful. Whenever the statistician 
is beginning to study a problem and is investigating the relation¬ 
ships involved, he is likely to use this method first of all. It 
tells him whether or not it is worth while to proceed with more 
complicated methods, and gives him a basis for selecting the type 
of method to use. It has the advantage that the results are 
readily understood even by the uninitiated and that the compu¬ 
tation is short and easy. 

11.4. The Scatter Diagram.—We turn now to a second simple 
method for studying relationships, less common than the one just 
mentioned, but, nevertheless, a very helpful tool. This is the 
method of plotting the data on a scatter diagram , or scattergram i in 
order that one may see the relationship. It is a graphic method, 
making its appeal to the eye. 
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Our first step is to designate one of our variables as the inde¬ 
pendent variable and the other as the dependent variable . We 
decide which is which on the basis of our problem. We said in 
Sec. 11.2 that problems of relationship ordinarily arise when men 
are trying to estimate something. They wish, for example, to 
estimate the price of potatoes, and in making their estimate 
they make use of data showing the number of bushels of potatoes 
harvested. The variable which we are trying to estimate (the 
price of potatoes in the example just given) we call the dependent 
variable, since our estimate of its value will be dependent on other 
data. The variable which we are using as a basis for making our 
estimate (bushels harvested in our example) we call the inde¬ 
pendent variable, since we select values of this variable independ¬ 
ently and then find what values of the dependent variable 
“ depend on” them. Two men studying the relationship between 
the size of the potato crop and the price of potatoes may make 
opposite choices in the matter of dependence and independence. 
One may make prices dependent while the other makes prices 
independent. Yet both may be right. One of them wishes to 
estimate the price which there will be with given sizes of potato 
crops. He makes prices dependent. The other wishes to esti¬ 
mate the sizes of potato crops, using as a basis the price of pota¬ 
toes. He makes the size of crop dependent. 

Let us start with a simple case similar to that already men¬ 
tioned in Table 11.1, but including for the time being only a small 
number of cases. Table 11.5 gives the ages and blood pressures 
of 21 people. 1 We could try either to estimate people’s ages from 
their blood pressures (as we estimate a horse’s age from his teeth) 
or their blood pressures from their ages (as we would do if we were 
trying to find the average or expected or “ normal ” blood pressure 
for any age). In the first case, age would be dependent and blood 
pressure independent. In the second case, blood pressure would 
be dependent and age independent. The one arrangement is as 
good as the other, but we shall adopt the latter and make blood 
pressure dependent. In other words, we shall try to find a 
method for estimating blood pressure from age. 

First we must decide whether or not age and blood pressure are 
related. We can tell roughly if we plot the data of Table 11.5 
on cross-section paper. In such cases we always plot the inde- 

1 Data kindly furnished from actual cases by Ralph L. Gilman, M.D. 
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pendent variable on the horizontal and the dependent variable on 
the vertical axis. This time, therefore, we shall plot ages along 
the horizontal and blood pressures along the vertical axis. We 
note that the ages run from 12 to 78 years, and therefore we lay off 
a horizontal axis covering this range. The blood pressures vary 

Table 11.5. —Ages and Systolic Blood Pressures of 21 People 


Name 

: 

i 

■ 

Age 

(years) 

Systolic 

Blood 

Pressure 

(millimeters) 

Ruby . . . 

! 

J2 

108 

Betsey 

12 

120 

Muriel 

12 

128 

Mary. 

13 

110 

Alice 

17 

124 

Frank.. 

18 

110 

Walter . 

24 

130 

Dorothy 

25 

130 

Frederick 

20 

124 

Esther 

27 

120 

Sidney 

28 

130 

Albert 

33 

140 

Edith 

35 

130 

John 

| 35 

140 

Robert 

44 

120 

Dan 

40 

126 

Priscilla 

55 ] 

130 

Donald 

57 | 

120 

James 

00 j 

114 

Peter 

02 

140 

Ralph 

78 

100 


from 108 to 1(50, and our vertical axis must cover that range. We 
then proceed as in Fig. 11.1, putting a point on our chart to repre¬ 
sent each person in Table 11.5. Each point represents a combi¬ 
nation of age in years and blood pressure in millimeters. For 
example, the first person listed in Table 11.5 is 12 years old. 
Therefore the point representing her in Fig. 11.1 must have an 
abscissa (horizontal position) of 12. But this same person has a 
blood pressure of 108 mm., so her point must have an ordinate 
(vertical position) of 108. Her point on the diagram is now 
determined by these two values, and this particular point is indi- 
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cated in Fig* 11.1 by an arrow. The other 20 points of Fig. 11.1 
represent the other 20 persons listed in Table 11.5. Such a 
chart, where the points represent pairs of values, is called a scatter 
diagram or a scatter gram. 

Now that the scatter diagram is completed we study it care¬ 
fully to see if the points fall at random over the surface or if they 
form some definite pattern. In Fig. 11.1 there is not such a 



Fig. 11.1. Ages and blood pressures of 21 persons. 

clearly defined pattern as to be immediately noticeable; yet the 
points are not really distributed completely at random. We note 
a tendency for the points to fall in a rather wide and poorly 
defined band running from the lower left to the upper right. 
Our main purpose in making a scatter diagram is to look for such 
patterns. In this illustrative case we have studied relatively few 
cases, and what relationship there is is not particularly close, so 
the pattern is not completely clear. Let us make a similar dia¬ 
gram of the blood-pressure data of Table 11.1, where thousands of 
cases are studied, and where only the averages are given for the 
various age groups. This process of averaging will eliminate the 
wide scatter of Fig. 11.1, and the inclusion of so many cases will 
tend to give more regularity to the pattern. The diagram 
appears as Fig. 11.2. Here we note that the points on the dia¬ 
gram fall in a definite and clear band from the lower left to the 
upper right. In fact, they fall almost along a straight line. To 
indicate this fact a straight broken line has been drawn with a 
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ruler in such a way that it follows along the general path of the 
dots on the diagram. 

11.5. Regression Lines and Trend Lines.—The line which has 
been drawn on Fig. 11.2 indicates in general terms the nature of 
the relationship between age and blood pressure. -To be sure, the 
points do not fall exactly on the line; but we can think of at least 
two reasons for their failure to do so. In the first place, as we 



Fig. 11.2. Average systolic blood pressures at various ages, based on 50,000 
cases. With freehand linear regression line. 

learned in Chap. II, our various measurements are subject to 
error. We do not know with complete accuracy the ages of the 
persons whose blood pressure was measured. Young people tend 
to overstate their ages, middle-aged people to understate them, 
and very old people to overstate them again. Moreover almost 
certainly most of the ages were given to the nearest year, or as of 
the last birthday—some one way and some another. In addition 
blood pressures are not obtained with complete accuracy. A 
quick inspection of the data of Table 11.5 will show, for example, 
that the physician who took these readings often rounded them 
off to the nearest 10 mm., and moreover had a striking preference 
for even numbers. Possibly his testing apparatus was calibrated 
only in even millimeters. At any rate, it is almost certain that 
his patients did not really all have even-numbered blood pressures. 

The inaccuracy of measurement, then, may account in part for 
the fact that the points of Fig. 11.2 do not all fall exactly on a 
straight line. But there is another reason which is even more 
likely to account for it. This is that there are almost certainly 
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other things in addition to age which affect blood pressure. In 
Fig. 11.1 the points are widely scattered because the blood pres¬ 
sures of the individuals were affected not only by age but also by 
condition of health, recency of exercise, presence or absence of 
mental strain, and a host of other things. The three 12-ycar-old 
girls did not all have the same blood pressure. Their ages were 
identical (as far as we can tell from Table 11.5), but their blood 
pressures were 108, 120, and 128. Similarly, Edith and John 
were the same age, but their blood pressures differed by 10 mm. 



Y e c* r 


Fig. 11.3. United States petroleum production, 1917-1929, with freehand 
straight-line trend. 

These differences obviously were not the result of differences in 
age, but of other factors. These other factors lend to obscure 
the relationship between age and blood pressure, scattering the 
points on Fig. 11.1 and to a lesser degree on Fig. 11.2, although 
in the latter figures these other factors are largely “averaged out” 
by taking the average blood pressures of thousands of persons 
who are approximately equal in age but who differ in the other 
disturbing factors. After this averaging process it is easy to see 
the general nature of the pattern in Fig. 11.2, and even to deter¬ 
mine pretty closely where the line should run. 
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When we draw a scatter diagram we are usually interested in 
finding some line, similar to the straight line in Fig. 11.2, which 
shows the basic underlying relationship between the variables. 
We assume that, had there not been inaccuracies of measurement 
and other disturbing influences, the points would have fallen on 
the line. We assume, in other words, that the line shows the real 
relationship between the variables after we have eliminated the 
effects of disturbing forces. Such a line we call a regression line 
for reasons which will become apparent later. In cases where our 
independent variable is time (in those cases, that is, where we 
show how some variable changes with the passage of time), we 
call the line a trend line. Thus Fig. 11.3 shows the United States 
production of petroleum from 1917 to 1929 with a trend line. 
This trend line, like a regression line, shows what we think would 
have happened in the absence of disturbing forces. The produc¬ 
tion of petroleum actually did fluctuate from year to year; but if 
there had been no inaccuracies of measurement, and if there had 
been no minor temporary disturbing forces, we think that the 
long-run forces would have yielded productions right along the 
straight line. We think of the values along the line as “normal” 
productions, and the deviations above and below the line as 
abnormalities. 

Regression lines and trend lines need not be straight lines, of 
course. They may contain curves and bends and wiggles. Yet 
we do assume that the basic underlying pattern is a fairly simple 
and regular one. The scientist tries always to simplify—to elimi¬ 
nate the confusion of minor variation, and to discover the sim¬ 
plicity which he thinks rules if only he can discover it. In the 
remainder of this chapter we shall describe further the methods 
of discovering and evaluating linear or straight-line trends, and 
in the following chapter we shall consider the fitting of various 
sorts of simple curves. 

11.6. The Freehand Linear Trend. —If we turn back now from 
Fig. 11.2, where the position of the regression line is almost imme¬ 
diately evident, to Fig. 11.1, where the scatter obscures the 
underlying relationship and one sees only with difficulty that 
there is any relationship present, let us try to find a straight 
regression line to depict the relationship between age and blood 
pressure. We may start by stretching a thread over the points 
on the diagram and moving it about until we think that we have 
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found the best possible position; or we may use a transparent 
ruler for the same purpose. We note that the points toward the 
left-hand side of the diagram tend to be concentrated toward the 
lower part, while at the right-hand side they are scattered more 
widely, but, on the average, lie farther up on the vertical scale. 
We conclude, then, that our line should slope up toward the 
right—that whatever relationship there is is a positive one—that 
greater ages and higher blood pressures tend to be associated. 
We may be content to draw our line entirely by eye, relying on 
our judgment to get the proper height and slope; or we may, on 
the other hand, follow in some degree the system used in making 
Fig. 11.2. We have not enough cases to compute nine or ten 
averages, as was done in that case. (In fact, we do not really 
have enough cases to form any firm opinions anyway. In prac¬ 
tice the statistician would require many more than the 21 cases 
which we are using.) But even in our case, where we have used 
a small number of cases in an effort to get a simple example of our 
method, we could separate the 10 youngest persons and the 10 
oldest persons, and for each group compute the average age and 
the average blood pressure. If we do this we find that the 10 
youngest people in Table 11.5 have an average age of 18.G years 
and an average blood pressure of 122.2 mm. The 10 oldest 
people have an average age of 50.5 years and an average blood 
pressure of 132.0 mm. We can, if we wish, add these two points 
to our scatter diagram and draw our straight regression line 
through them. This has been done in Fig. 11.4, where the 
regression line appears added to the points of Fig. 11.1. 

11.7. The Method of Selected Points.—The student of ele¬ 
mentary algebra will remember that any straight line can be 
described by a simple equation of the general form Y = a + bX , 
where Y and X are values of our dependent and independent vari¬ 
ables, and a and b are two numerical constants which we must 
determine, in the method of selected points we determine these 
values from points which we select on the regression line. We 
select two points, usually one near each end of the line, and read 
from the diagram their X and Y values. For example, in Fig. 
11.4 we might note that the regression line has a height of about 
138 at the age of 70. This means that it has a Y value of 138 
when the X value is 70. At the left-hand side of the chart we 
note that the line has a height of about 120 at the age of 10. 
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Thus another pair of values, read from the line, is X » 10 and 
Y » 120. (These values could be read more accurately from a 
piece of cross-section paper than they can be from Fig. 11.4, but 
even that chart will serve for purposes of illustration.) 

We now substitute these values of X and Y in our type equation 
for the straight line, Y = a + bX. Starting first with the values 



Fia. 11.4. Ages and blood pressures of 21 persons, with freehand regression 
line. 

Y = 138 and X — 70, and substituting these values for Y and X 
in the type equation, we get 

138 = o + 70b 

This is our first observation equation. We then use the other pair 
of values which we read from the diagram, Y = 120 and X = 10. 
Substituting these in our type equation we get 

120 - a + 10b 

We now have two observation equations with two unknowns. If 
we solve them simultaneously we find that a = 117 and b = 0.3. 
This tells us that the general type equation for the straight line, 

Y = a + bX , becomes, for this particular problem, 

F = 117 +0.3X 

Every equation of the general form F = a + bX y with any values 
whatever for a and 6, will be the equation of some straight line; 
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but the particular equation, Y = 117 + 0.3X, is the equation 
of that specific straight line which appears in Fig. 11.4. It is 
the straight line which passes through the two points which we 
selected, and that is why we call the process the method of selected 
points . 

If we have drawn a line entirely freehand, we read any two 
points from it, but in practice select points close to the ends in 
order to minimize any slight error which might occur in reading 
the diagram. To be sure, we do not read the diagram more accu¬ 
rately near the extremes than elsewhere, but if we make minor 
errors in reading near the center, and then project the line to the 
ends of the diagram, we magnify the errors in the projection. In 
our case, however, we did not finally draw the line entirely free¬ 
hand. We drew it through two points representing the average 
ages and blood pressures of the 10 youngest and the 10 oldest 
persons in the distribution. Since we drew the line so that it 
passes through these two points, it is logical for us to use these 
two same points as our “selected points” in computing the equa¬ 
tion of our line. Reference to Sec. 11.6 will show that the two 
points in question had the values X = 18.6 and Y — 122.2; and 
X = 50.5 and Y — 132.0. If we substitute these values in the 
type equation of the straight line, we get the two observation 
equations: 

122.2 = a + 18.66 

132.0 = a + 50.56 

Solving these two equations for a and 6 we get 
a = 116.5 and 6-0.31 

The equation of the line which passes through the two points in 
question is, then, 

Y = 116.5 + 0.31X 

This is very nearly the same as the equation 
7 = 117 + 0.3X 

which we found when we read values roughly off the diagram of 
Fig. 11.4; but the equation which we have just found should be 
slightly preferable to our earlier one, since the line of Fig. 11.4 
was drawn to pass through the points of average age and blood 
pressure, and our more recent regression equation does pass 
through these two points. They are the two points which we 



FITTING STRAIGHT LINES 


307 


selected this time for purposes of computation* Let us say, then, 
that the straight regression line which appears in Fig. 11.4 has 
the formula Y = 116.5 + 0.31X. We shall interpret the mean¬ 
ing of this equation shortly, but first we shall turn to alternative 
methods of finding such equations. 

11.8. The Method of Least Squares. —The freehand trend, 
with its equation determined by the method of selected points, 
has the advantage that it can be found quickly and easily with a 
minimum of arithmetic. Yet the location of the line is deter¬ 
mined subjectively, merely by selecting the line that “looks best.” 
We realize that two equally competent statisticians working with 
the same data might not draw exactly the same line. This raises 
the question as to whether or not some one line is better than 
another—whether or not we can choose some one “best” line to 
describe the relationship which we are studying. 

When the problem is stated in this way, we realize at once that 
there is a “best” line. It is the line which would show actually 
the most probable blood pressure at each age, eliminating all the 
variations from errors of measurement or from other variables. 
Any regression line should show us the real underlying basic rela¬ 
tionship between X and Y —between the two variables studied. 
In the case of trend lines we should show the line along which the 
values of Y would actually have moved if they had been subject 
to long-run forces only—if all temporary and random forces had 
been eliminated. Our search is not for just any straight line, but 
for the particular straight line that really shows the underlying 
pattern. 

But unless one has faith in the crystal ball or the Ouija board, 
he can never know what would have been true if some forces had 
been different. We are therefore forced to guess what would 
have happened. 1 Yet some guesses are better than others. If 
I am asked to guess the height of someone, knowing nothing 

1 Some people may prefer to dignify the processes involved here by calling 
them “estimates” rather than “guesses,” The name is really not impor¬ 
tant if the student understands that the process is one based on reasoning. 
In practice it seems to be true that students more often put too much faith 
in the results of least squares than too little. They think that somehow the 
mathematical processes of the least-squares method give them an answer 
that is “correct,” rather than an estimate or guess of what is correct. It is 
to offset this tendency toward blind and innocent acceptance that I prefer 
to speak of the processes involved as guesswork rather than as estimation. 
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save that he is a Harvard student, I can make a better guess on 
the basis of the facts in Table 5.1, page 83, or on the basis of 
the statistical summary of these facts on page 210, than I can 
unassisted. If you are to toss a penny fifty times, and I am 
asked to guess how often it will fall with “heads” uppermost, I 
am wiser to base my guess on reasoning than I am to select a 
number at random. And in selecting trend lines it is also true 
that some guesses are better than others. An infinite number 
of straight lines can be drawn upon a chart, and all of them may 



Fig. 11.5. Hypothetical output data from Table* 10.10. 

slant approximately in the direction of the secular movement, 
but just as 25 “heads” are more likely in 50 throws of a coin 
than 28 or 30 heads, so one of these straight lines is more likely 
to be correct than any of the others. 

Let us look at the chart in Fig. 11.5. Suppose we wish to 
represent the long-run movement of this chart by a straight line, 
as seems reasonable from casual inspection of the data. It is 
immediately apparent that no straight line will describe what 
happened in the sense that it will pass through the various points 
on the diagram. The points do not lie along any straight line. 
Therefore any straight line which we draw will have errors. We 
might draw a line so high on the diagram that all the actual 
points would lie below it; on the other hand, we might place our 
line so low that all the actual points would lie above it. In prac¬ 
tice we would be more likely, however, to draw a line something 
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like the one in Fig. 11.6, such that some of the actual points lie 
above and some below the line. In this case we can show our 
errors by means of the light dotted vertical lines connecting each 
of the original points with the straight trend line. 

If we assume, now, that our trend is to show the ordinary or 
expected course of events, and that the variations around the 
trend are due to more or less chance occurrences, we can deter¬ 
mine the chances that any particular set of errors or deviations 



Fig. 11.6. Data of Table 10.10 with freehand straight-line trend. 


would occur. Under these assumptions it can be shown that that 
trend line is most likely from which the sum of the squared devia¬ 
tions is a minimum. Perhaps we can illustrate the meaning of 
these terms best by using the data of Fig. 11.6 as an example. 
The original data of this figure are shown in the first two columns 
of Table 11.6. The third column shows for each year the height 
of the straight line trend which appears on the chart. The fourth 
column shows the amount of the error or the residual, found by 
subtracting the trend value from the actual value. The last 
column shows the squares of these residuals, and at the bottom 
of this last column is the sum of the squared residuals, or the 
sum of the squared errors. In this particular case the sum 
amounts to 6.25. 

If we were to draw other straight lines on the chart, we should 
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get other sets of errors, other squared errors, and different sums 
of the squared errors. For example, the student might try using 
as trend values the numbers 0.5, 1.1, 1.7, 2.3, 2.9, 3.5, 4.1, 4.7, 
5.3, 5.9, and 6.5. The difference between each successive pair of 
numbers in this series is 0.6, and if they are plotted on Fig. 11.6, 
they will give a straight line which is slightly steeper than the 
straight line already shown in that figure and crosses it at the 


Table 11.6. — Illustration of Squared Errors from Data of Fig. 11.6 


Year 

Output 

(000) 

Trend 

Value 

Error 

Squared 

Error 

1932 

1 

1.0 

0.0 

0.00 

1933 

0 

1.5 

— 1.5 

2.25 

1934 

2 

2.0 

0.0 

0.00 

1935 

3 

2.5 

0.5 

0.25 

1936 

3 

3.0 

0.0 

0.00 

1937 

4 

3.5 

0.5 

0.25 

1938 

3 

4.0 

-1 0 

1.00 

1939 

5 

4.5 

0 5 

0.25 

1940 

5 

5 0 

0.0 

0.00 

1941 

7 

5 5 

1.5 

2.25 

1942 

6 

6 0 

0 0 

0 00 


~ 6.25 


middle of the chart. If the student will make up a table similar 
to Table 11.6, using these new trend values, computing his new 
errors and new squared errors, and adding his column of squared 
errors, he will find a sum of 5.15 instead of the sum 6.25 discovered 
in Table 11.6. The fact that the sum of the squared errors is 
smaller in the new case than in the old one means that the new 
line is, on the basis of our assumptions, more likely to be right 
than the old line. It does not show that it is the correct line, 
and we never do know what is the correct line. But the smaller 
the sum of the squared errors, the more likely the line is to be 
correct. 1 

If we wanted to find the one straight line which fitted the data 
best of all—which gave a smaller sum of squared errors than any 
other straight line—it would evidently take too long to go at it 

1 For a simple proof of this statement, under our assumptions that the 
errors are pure chance affairs, see F. L. Griffin, “ Introduction to Mathe¬ 
matical Analysis/ 7 pp. 456-457, Houghton Mifflin Company, Boston, 1921. 




FITTING STRAIGHT LINES 


311 


by trial and error. We cannot try 20 or 60 or 150 different lines, 
in each case computing the sum of the squared errors, and finally 
select the line with the smallest sum of the squared errors. This 
would take too long. But fortunately we can find the line we 
want by a very simple method. This method is called the method 
of least squares because it gives us the one line from which the 
sum of the squares of the errors is the smallest possible for any 
line of the type being fitted. We shall see how to find the “best 
fitting” straight line, the “best fitting” second-degree parabola, 
and the “best fitting” reciprocal curve by the method of least 
squares. The student must remember that the least-squares 
line is not necessarily the best one. In the first place, the straight 
line fitted by least squares is merely more likely to be right than 
any other straight line. Perhaps the basic trend was not a straight 
line at all. In that case the straight line fitted by least squares 
will not be the correct line. Similarly a second-degree parabola 
fitted by least squares is more likely to be correct than any other 
second-degree parabola . We can generalize by saying that when 
we fit any line or curve by the method of least squares we get the 
line or curve that is more likely to be correct than any other line 
of the particular “family” which we could fit. Even so, it is 
more likely to be correct than other lines of the same family only 
if we are correct in our assumption that the errors or the residuals 
around the line are the results of random chance forces. In such 
a case the residuals will tend to be normally distributed, most of 
them clustered close to the line, with points getting less and less 
common as we get farther and farther from the line, and with 
points above and below the line being approximately evenly 
balanced. 

11,9. Fitting a Straight Line by Least Squares. —We learned 
in Sec. 11.7 that every straight line will have the general form 

Y = a + bX 

We wish to find the values of a and b that will give the one straight 
line which fits best (the “best fit” being defined as that which 
minimizes the sum of the squared residuals or deviations). It is 
easy to show that these values of a and b can be determined from 
the following two normal equations: 1 

1 Each year (or other period of time) is designated by X. 

Each value of the other variable (as petroleum production) is designated 
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Na + bXX - XY 
aXX + &2X 2 - sxr 

In order to solve these two equations we need the following 
values: 

N 2X zy 
2X 2 2 XY 

We shall start by fitting the least-squares straight line to the data 
of Table 11.5. The computations appear in Table 11.7. The 
first two columns of numbers are repeated from Table 11.5. 
They show our original values of the two variables. The last 
two columns are derived from them. As is always done conven¬ 
tionally, we symbolize the independent variable by X and the 


by F. 

The equation of the straight line at any point X is Y = a 4 bX. 

If the point does not fall on the line, the distance from the point to the line 
(that is, the deviation) is represented by d. Thus 

d = a + bX - Y 
d* - (a 4 bX - Y)* 

This is true for each deviation. If we sum all such terms, to get the sum of 
the deviations, we have 

Zd* = 2(a 4 bX - Y)* 

This is the value we wish to minimize. Let us represent it by /. We mini¬ 
mize by setting the partial derivatives with respect to a and b equal to 
zero. That is, 

Y - 2S(a + bX - Y) = 0 
da 

^ = 22(a + bX - Y)X - 0 

Dividing by 2 and then summing as directed, we get 

Za 4 bZX - ZY * 0 
aZX + bZX z - ZXY = 0 

Since Za (when a is a constant) = Na, we have, by transposing the last 
terms, 

Na -f bZX « ZY 
aZX 4 bZX * - ZXY 

If we solve these equations for a and b after substituting the proper values 
of N f ZX , ZY , ZX 2 f and ZXY, we get the values of a and b which minimize 
2d*. These are what we want. 
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dependent variable by Y. We compute a column of X v & and a 
column of XY’s , and find the appropriate totals. These are the 
values which we need to substitute in the normal equations to get 


Table 11.7. —Computation of Least-squares Straight Regression Line 
between Age and Blood Pressure 


Name 

Age 

(years) 

(X) 

Blood 

Pressure 

(millimeters) 

(Y) 

(X*) 

(XY) 

Ruby .... 

12 

108 

144 

1,296 

Betsey ... 

12 

120 

144 

1,440 

Muriel. 

12 

128 

144 

1,536 

Mary . 

13 

116 

169 

1,508 

Alice . 

17 

124 

289 

2,108 

Frank . 

18 

116 

324 

2,088 

Walter 

24 

130 

576 

3,120 

Dorothy .... 

25 

130 

625 

3,250 

Frederick 

26 

124 

676 

3,224 

Esther. 

27 

126 

729 

3,402 

Sidney. 

28 

130 

784 

3,640 

Albert. 

33 

140 

1,089 

4,620 

Edith. 

35 

130 

1,225 

4,550 

John. 

35 

140 

1,225 

4,900 

Robert. 

44 

120 

1,936 

5,280 

Dan . 

46 

126 

2,116 

5,746 

Priscilla. 

55 

130 

3,025 

7,150 

Donald. 

57 

120 

3,249 

6,840 

James. 

60 

114 

3,600 

6,840 

Peter. 

62 

140 

3,844 

8,680 

Ralph. 

78 

160 

6,084 

12,480 

Totals. 

719 

2,672 

31,997 

93,698 


21a + 7196 = 2672 
719a + 31,9976 - 93,698 

Solving these two simultaneous equations we discover the values 
of a and 6 as follows: 

a = 117 and 6 = 0.3 


The equation for our straight line is, then 
7 * 117 + 0.3X 
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This is not the equation for just any straight line, but is the equa¬ 
tion for the particular straight line from which the sum of the 
squared residuals is a minimum. It is the least-squares straight 
line. If the errors around this line are normally distributed, this 
line is more likely to be “right” than any other straight line. 
Possibly we should emphasize the limitations and qualifications 
of this statement again. The line fitted by the method of least 
squares is not necessarily the “best” line. We must qualify our 
statement to include: 

1. Perhaps the underlying relationship is not a straight line 
relationship at all. This line is not the best line, but the best 
straight line that can be fitted. 

2. And it is not necessarily even the best straight line. It 
merely has a better chance of being right than any other straight 
line. If we throw a penny 100 times, we have a better chance of 
getting 50 heads than any other number—but we may not really 
get 50 heads. This line has a better chance of being the right 
line than any other straight line, but it may not be the right one 
even if the right line is straight. 

3. And if the errors or residuals are not normally distributed, 
then there are no advantages of the least-squares line over other 
straight lines. And since we cannot know until after we fit it 
whether the errors are normally distributed or not, at the time we 
decide to fit the line by the method of least squares we cannot 
know that it has any inherent advantage over a straight line care¬ 
fully fitted by eye without mathematical computation. 

The great French mathematician Gabriel Lippman once said 
that everyone likes the method of least squares: the theoretical 
mathematician because he believes that practicing statisticians 
have found empirically that errors are normally distributed and 
hence that there is theoretical basis for choosing the method, and 
the practicing statistician because he believes that the theoretical 
mathematician has proved that it is the right and logical method 
to use. At any rate the student would do well to realize that 
the adoption of a mathematical method such as the method of 
least squares can never protect him against errors of judgment. 
It can never make a relationship straight which was curved before. 
It can never make a distribution of errors normal which was 
skewed before. The real reasons for employing the method of 
least squares are that it is simple, it is objective, and it may turn 



FITTING STRAIGHT LINES 


315 


out when one is all done to have advantages in probability (if the 
errors turn out to be normally distributed). For a large propor¬ 
tion of actual cases the simple freehand method is as logical and as 
acceptable except with those people who imagine that arith¬ 
metical calculation sanctifies scientific endeavor, and is an end in 
itself. 

While we are at it, let us fit by least squares a straight line to 
the data of Fig. 11.2, based on over 50,000 cases. We shall fit 
the line to the 10 averages shown in the diagram rather than to 
the actual original 50,000 cases. Taking our figures from Table 
11.1 on page 295, and using the class marks as our values of X, 
we get Table 11.8. The last two columns are derived from these 

Table 11.8. —Computation of Straight Regression Line bt Least 
Squares; Data of Table 11.1 


Age 

(years) 

(X) 

Blood 

Pressure 

(millimeters) 

(!') 

(A' 5 ) 

(XY) 


17 

120 

289 ! 

2,040 


22 

122 

484 

2,684 


27 

123 

729 

3,321 


32 

124 

1,024 

3,968 


37 

126 

1,369 

4,662 


42 

128 

1,704 

5,376 


47 

130 

2,209 

6,110 


52 

132 

2,704 

6,864 


57 

134 

3,249 

7,638 


62 

135 

3,844 

8,370 

Sums 

395 

1,274 

17,665 

51,033 


original data, and the totals from Table 11.8 are substituted in 
the normal equations to give 

10a + 3956 = 1274 
395a + 17,6656 = 51,033 

Solving these equations we find that 

a = 113.8 and 6 = 0.344 

The equation for our line is, therefore, 

Y = 113.8 + 0.344X 
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We can now summarize several attempts which we have made to 
find the formula of a straight line representing the relationship 
between age and blood pressure. 

1. Based on 21 cases of Fig. 11.4 and a freehand trend: 

Y - 117 + 0.3X 


2. Based on 21 
averages: 


cases, with regression line drawn through two group 
Y » 11C 5 -f 0.31X 


3. Based on 21 cases, fitted by least squares: 


F = 117 + 0.3A 

4. Based on 50,000 cases of Table 11.1, fitted by least, squares: 
Y - 113.8 + 0.344A" 


11.10. Interpreting Results of Regression Formulas.—The 

results summarized at the end of the preceding paragraph are 
naturally very similar, since they describe approximately the 
same thing. Each equation is an attempt to describe the rela¬ 
tionship between age and blood pressure. The last two may have 
some advantage over the first two in the fact that they were com¬ 
puted by the method of least squares, and therefore if the resid¬ 
uals are normally distributed they have a greater chance of being 
right than other straight lines. The last equation, as contrasted 
with the others, is based on some 50,000 cases. We would be 
foolish to draw conclusions about anything as complicated as this 
relationship on the basis of 21 observations, and in this regard our 
first three equations are immediately suspect. Twenty-one 
cases may serve as a basis for describing the mechanics of compu¬ 
tation, but they are entirely inadequate as a basis for forming 
reliable conclusions. Let us agree at this point, then, to accept 
the fourth and last of the equations for the time being as our best 
description of the relationship, and find out what it means. The 
equation tells us that 

Y = 133.8 + 0.344X 

But Y is blood pressure in millimeters and X is age in years. If 
we select arbitrarily any age (that is, any value of our independ¬ 
ent variable, X) we can substitute it in the equation and find the 
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corresponding value of Y. For example, at the age of 17 years 
we have 

Y - 113.8 + (0.344) (17) = 119.6 

Our best estimate is that a 17-year-old will have a blood pressure 
of 119.6 mm. The figures of Table 11.1 on page 295 state that 
the average for those actually measured was 120, but we decide 
on the basis of our equation that “normal” for this age was 
really 119.6. Similarly for the age of 62 we find that 

Y = 113.8 + (0.344) (62) = 135.1 

Table 11.1 states that the average blood pressure for the 62-year- 
olds actually measured was 135 mm. We estimate from our 
equation that the “normal” blood pressure for that age was actu¬ 
ally 135.1. Note that our simple equation tells us substantially 
everything which we can get from the table—and much more 
besides. For example, if we want to know the “normal” blood 
pressure for a person 49 years old, we cannot find it in Table 11.1; 
but we can compute quickly from the formula that the blood 
pressure would be 130.7 mm. And of course, if we want to show 
our formula graphically we can always add it to a chart of the 
original data. After all, our equation is the equation of a straight 
line. If we can find any two points on the line, all we have to do 
is connect them with a ruler. But we have already computed 
three points on the line. We have found that at age 17 the line 
has a height of 119.6; at age 62 it has a height of 135.1; and at age 
49 it has a height of 130.7. We can plot points on our scatter 
diagram representing blood pressures of 119.6 at age 17 and 135.1 
at age 62 (thus selecting points at the extremes of our diagram) 
and connect them with a straight line as in Fig. 11.7. Now our 
diagram shows both the original points and the least-squares 
straight regression line which gives a summary description of 
them. The points do not lie exactly on the line, to be sure, and 
in a case like Fig. 11.4 on page 305 the scatter around the line is 
even greater. But we do not think of there being error in the line 
because it does not pass through the points. No straight line 
could pass through all the points. On the contrary, we believe 
that the error lies in the points—in their failure to fall along the 
line. We think that our line, derived though it is from the 
points, is “correct” and the points “wrong.” More accurately, 
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we think that the individual points were subject to a great many 
errors of measurement and influenced by many forces other than 
the ages of the persons, and that as a result the individual points 
reflect many influences in addition to the relationship between 
age and blood pressure. Our line, we hope, has succeeded in 
eliminating all these minor confusing irregularities, and left us 
the relatively pure and simple relationship itself. We remind 
the student again that the statistician is not primarily interested 
in the particular data which he studies, but purposely tries to 



Fig. 11.7. Average blood pressure at various ages in 50,000 persons, with 
least-squares straight regression line. 

derive from the irregularities and imperfections of these data 
ideas about cases which have never been studied at all. Again we 
are shifting from the sample which we have studied to the uni¬ 
verse which we have not studied (and which we usually cannot 
study). 

But we can be even more specific in the interpretation of our 
formula Y = 113.8 + 0.344X. Let us note that each time we 
add one more unit of X we must, according to our equation, add 
0.344 more units of Y. In our problem, every time we add a year 
of age, we add 0.344 mm. of blood pressure. To put this in other 
terms, there is a general average tendency for blood pressure to 
rise 0.344 mm. per year. In any linear regression equation the 
value of b in the formula Y = a + bX shows the number of units 
increase (or decrease) in the dependent variable which accompany 
an increase of one unit in the independent variable. This value 
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is known as the regression coefficient, and is often a very important 
conclusion of scientific study. We can see just as easily that, 
if we let X assume the value of zero, then our equation becomes 

Y ~ a + Ob or F = a 

In other words, the value of a in our equation of the general form 
F = a + bX tells us the height of our line if we extend it back 
to the point where X is zero. Sometimes it makes sense to say 
that the value of a is the value of F when X equals zero. Thus 
if we were finding the relationship between the size of the family 
on the one hand, and the annual expenditure for moving pictures 
on the other, we might let X represent the number of children 
and F the annual expenditure for movies in dollars. Now the 
value of a in our regression equation would tell us the typical or 
average or normal or expected expenditure for families with no 
children. On the other hand, if we were finding the relationship 
between the heights and weights of 10-year-old boys it would not 
make much sense to say that the value of a was “the expected 
weight in pounds of boys with a height of zero!” In such cases 
we must recognize that the value of a merely tells us the height of 
our line if we extend it back (“extrapolate” it, as we say) to the 
point on our horizontal scale where X would be equal to zero. 
Really our value of a determines the height of the line, and the 
value of b determines the slope of the line. If the line is straight, 
we can tell all about it by knowing its height at any one point 
(such as at the point where X is zero) and the rate at which it rises 
or falls. If the value of b is positive, as in the case we have 
studied, it means that the line is rising as we move toward higher 
values of X, In our case the blood pressures increase on the 
average as age increases. The relationship is what we called a 
‘ ‘ positive” one. But if we were to study the relationship between 
the size of the potato crop and the price of potatoes we might 
anticipate that large crops w r ould bring low prices. If our actual 
data reflected such a relationship, the value of b w r ould turn out to 
be negative, and we would say that our relationship was an 
“inverse” or “negative” one. If our value of b turned out to be 
—0.62, it would mean that there was a general tendency for the 
dependent variable to diminish by 0.62 of a unit when the inde¬ 
pendent variable increased by one unit. Perhaps this would 
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mean that every extra million bushels of potatoes tended to 
reduce the price by 62 cents, as it would if our original data had 
been in units of dollars and millions of bushels. 

There is a common popular saying that “a man's blood pres¬ 
sure tends to be 100 plus his age ”—that is, a man tends to have a 
blood pressure of 136 at the age of 36. We see from our study 
that this rule of thumb comes very far from expressing the rela¬ 
tionship which actually exists. We would be a good deal closer 
if we were to say in round numbers that the average blood pres¬ 
sure is 114 plus one-third the age. This would correspond 
approximately to 113.8 plus 0.344 times the age. Our blood- 
pressure example furnishes us thus with a good illustration of the 
way in which scientific laws are discovered. One collects 
patiently a large mass of data, such as our 50,000 cases of blood 
pressure, he tries by various means to eliminate the vagaries of 
chance and the errors of measurement and the effects of other 
variables, and he arrives finally at a formal quantitative state¬ 
ment describing the fundamental relationship which had at first 
been obscured by its mass of detail. 

11.11. The Residuals or Errors.—The first columns of Table 
11.9 are taken directly from Table 11.1, showing average blood 
pressures at various ages. The third column is computed by our 
straight-line least-squares formula, Y = 113.8 + 0.344X. We 
can thus compare the original figures with those computed by the 
formula. We have used as ages the class marks of the groups 
given originally in Table 11.1. The fourth column shows us the 
differences between the actual original values and those com¬ 
puted by the formula. These are the “errors” or “residuals.” 
If we try to estimate our original figures from our formula, we 
shall be in error by these amounts. They correspond to the short 
vertical broken lines of Fig. 11.6 on page 309. The last column 
of Table 11.9 shows the squares of these errors. The sums of the 
errors themselves should be zero, but since we have rounded off 
our numbers we get a value of 0.2. The method of least squares 
always yields (barring inaccuracies in computation or from 
rounding) positive and negative errors which exactly balance, 
and the sum of the errors is zero. But we should turn our atten¬ 
tion particularly to the last column, with a total of 1.92. This is 
the sum of the squares of the errors, or the sum of the squares of 
the residuals. We have computed our line in such a way as to 
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yield the smallest sum in this column that we could get with any 
straight line which could be drawn on Fig. 11.7. Thousands of 
straight lines could be drawn across that diagram, differing more 
or less in slope and height. We could compute the squared 
residuals around any of these lines—the sums of the squares of 
the distances from the points to the line. But no matter how 
many such lines we drew, we would never get another with a sum 

Table 11.9.— Computation of Residuals around the Least Squares 
Regression Line 


Ago 

(years) 

Average Blood ! 
Pressure 

(millimeters) ■ 

Error 

1 

i 

Squared 

Error 

Actual J 

Estimated 



17 

120 

119 6 

0 4 

0.16 

22 

122 

121 .4 

0 6 

0 36 

27 

123 

123 1 

~0 1 

0 01 

32 

124 

124 8 

-0 8 

0 64 

37 

126 

126 5 

-0 5 

0 25 

42 

128 

128.2 

~0 2 

0 04 

47 

1 130 ! 

130.0 

0.0 

0.00 

52 

132 1 

131.7 

0.3 | 

0 09 

67 

134 

133 4 

0 6 

0 36 

62 

135 | 

135.1 

-0 1 

0 01 

Totals . 

0.2 

1 92 


of squared errors so small as 1.92. That is precisely what we 
mean when we say that this line was drawn by the method of 
least squares. The sum of the squared errors has been mini¬ 
mized. And if the errors themselves in the next to the last 
column of Table 11.9 form a normal distribution, then the line 
which we have drawn in Fig. 11.7 has a greater probability of 
being the “ correct ” line than any other straight line which could 
be drawn. To be sure, it would be pretty difficult to tell whether 
these 10 values form a normal distribution or not; but if we had, 
as we usually would in an actual statistical investigation, hun¬ 
dreds or thousands of values, then we could tabulate them and 
subject them to the tests described in Chaps. YII and VIII to see 
whether or not they were exactly or approximately normal in 
their distribution. 
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11.12. Short Cuts with Historical Data.—As we noted in Sec. 
11.5, it is not uncommon for us to measure the relationship 
between some other variable and time—to investigate the way in 
which the other variable changes with the passage of time. In 
such a case, when we are considering the long-run factors and 
long-range influences, we speak about the basic pattern of move¬ 
ment as a trend or secular trend. If this basic pattern seems to be 
linear, we may well wish to fit a straight line to the data, either 
by the freehand method and the method of selected points, or by 
the method of least squares. The problem is a perfectly straight¬ 
forward one, and we could easily proceed with the methods just 
described. But historical data often differ in one particular from 
other sorts of data, and the difference makes it possible to avail 
ourselves of short cuts which are significant. The difference is 
that while, when we measure the ages of persons and their blood 
pressures, the ages are irregularly spaced, when we measure time 
phenomena the events are usually spaced evenly. With blood 
pressures we may, as in Table 11.5, find three persons at age 12, 
one person at age 13, and no one at all at age 14. But when we 
study the growth in population, we usually have one value each 
year (or each decade) and only one. Similarly our data may be 
set up weekly or monthly or hourly, or there may be a reading 
every 10 seconds; but at any rate it is common for the readings to 
be regularly spaced, with one value of Y at each value of X, and 
no integral values of X skipped. 

To be more concrete, Table 11.10 gives values of petroleum 
production in the United States from 1917-1929. The inde¬ 
pendent variable is time—the years 1917, 1918, etc. The 
dependent variable Y is petroleum production in millions of 
barrels. We see that there is one value for each year. The Y 
values are spread evenly—one value each year and no years 
missing. This is a typical situation with historical data. 

We could, of course, use the values of the years just as they 
stand—1917, 1918, etc. But these are large numbers, and when 
we square them or multiply them by other numbers, we get mag¬ 
nitudes that are hard to handle. Consequently we follow a plan 
which reduces our arithmetic considerably. We shift the origin 
of our time series—the point of time from which we reckon our 
calendar, and figure forward and backward. After all, any 
calendar base is arbitrary, and different people use various bases. 
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In our society, time is usually reckoned from approximately the 
time of the birth of Christ, which gives us large numbers to 
deal with. The year we call 1953 is the year 7461-7462 of the 
Byzantine era, the year 5713-5714 according to the Jewish calen¬ 
dar, the year 2706 since the founding of Rome, etc. The more 
recent the starting point which we select, the smaller are the num- 


Table 11.10.— Computation of Straight-line Trend of Petroleum 
Production, 1917-1929 


Year 

Petroleum 
Production 
(millions of 
barrels) 

(Y) 

Year (origin 
1923) 

(X) 

XY 


1917 

335 

-6 

-2,010 

36 

1918 

35(5 

-5 

-1,780 

25 

1919 

378 

-4 

-1 .512 

16 

1920 

443 

-3 

- 1,329 

9 

1921 

472 

-2 

- 944 

4 

1922 

558 

-1 

- 558 

1 

1923 

732 

0 

0 

0 

1924 

714 

1 

711 

1 

1925 

764 1 

2 

1,528 

4 

1926 

771 [ 

3 

2,313 

9 

1927 

901 

4 

3,604 

16 

1928 

901 

5 

4,505 

25 

1929 

1,007 

6 

6,042 

36 

Totals 

8,332 | 

0 ! 

! 

10,573 

182 


hers with which we have to deal. Hence it is common in statis¬ 
tical problems to select some recent time as the origin of our data, 
computing forward and back from it; and particularly we can 
reduce our computations if we select as our origin the time which 
is at the center of the time series being studied. Thus in Table 
11.10 the center year is 1923 according to our usual calendar. 
But if we wish to use this as our origin, we merely call it the year 
zero, and give the years before it negative values and those after 
it positive values, just as we might talk of years b.c. and years 
a.d. If we select 1923 as our origin, the year 1924 becomes the 
year 1, the year 1925 the year 2, the year 1922 the year —1, and 
the year 1920 the year —3, etc. In Table 11.10 we have used the 
new values for X y our independent variable. Throughout the 
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remainder of our work we have then proceeded just as in any 
other case of least-squares lines, computing the values of N, XX, 
XX 2 , XY, and XXY. To be sure, since our X values are all small, 
our multiplications and squarings are easy. But in addition, we 
note at once that the positive and negative values of X exactly 
balance each other, so that the sum of the X’& is zero. The sums 
which we need for substitution in our normal equations are 

N = 13 2X = 0 XY = 8,332 XXY = 10,573 
XX 2 = 182 

The student will recall that our normal equations are 

Na + bXX = XY 
aXX + bXX 2 = X(XY) 

But since, with our new origin, XX is zero (and since it always will 
be zero if we take the origin at the center), we can rewrite our 
normal equations for this one situation as follows: 

Normal equations for least-square straight line when origin is at 
center of time period: 

Na = 27 
bXX 2 = X(XY) 

For this one case obviously a = XY/N , or the average value of 
7, and b — X{XY)/XX 2 . In our problem, with the origin at the 
center, a = 8,332/13 = 041, and b = 10,573/182 = 58.1, so that 
the equation for our least-squares straight line is 

7 = a + bX or 7 - 641 + 58.1X 

From this equation we can compute the value of the trend for 
any year. For example, the year 1928 is the year 5, and 7 = 
641 + (58.1) (5) — 641 + 290.5 = 931.5. While the actual pro¬ 
duction in 1928 was 901 million barrels, we estimate that if there 
had not been errors in measurement, and if there had not been 
temporary disrupting and disturbing influences, the production in 
1928 would have been 931.5 million barrels. Since our original 
figures were in years and in million of barrels, our trend equation 
is in years and in millions of barrels. The value of b is 58.1. 
Since it is positive we know that petroleum production tended to 
increase, and the figure tells us that, while some years it increased 
more rapidly and some years less rapidly and some years not at 
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all, the general tendency throughout the period was for produc¬ 
tion to increase at the rate of 58.1 million barrels per year. And 
when X is equal to zero (which means at the time of origin, or in 
this problem in 1923), the trend value was the value of a, or 641 
million barrels. The shift of origin to 1923 has eliminated much 
of our arithmetic, and made it possible for us to deal with small 
numbers. So that others can know how to interpret the data, we 
should state our conclusions thus: 

Y - 641 + 58.IX 
Origin 1923 

If we were to compute the trend value for 1920, we would note 
that in that year X has a value of —3, since 1920 is 3 years before 
the origin, and we would write Y = 641 + (58.1) ( — 3) = 641 — 
174.3 — 466.7. The actual petroleum production in 1920 was 
443 million barrels, but our equation tells us that the trend value 
(what we might think of as a “ normal ” production for 1920) was 
466.7 million barrels. 

And this is perhaps as good a place as any to warn the student 
against too confident use of such trend equations or regression 
equations. Suppose we were to ask what you would estimate as 
the most likely or “normal ” production for the year 1910. That 
is the year —13 on our basis; so we write 

Y - 641 (58.1)(—13) - 641 - 755.3 - -114.3 

We reach the preposterous conclusion that the typical production 
for 1910 was negative- that far from taking petroleum out of the 
earth, men must have been pumping it in at the rate of 114.3 
million barrels a year! When a statistician gets a foolish or 
impossible answer, it pays him to question his method, and in 
this case we need to warn the student against the practice of 
extrapolation. Just as interpolation is the finding of a value that 
lies between values we already know, so extrapolation is the pro¬ 
jecting of values beyond the limits of our knowledge. The good 
statistician is very cautious about extrapolation, which can very 
easily lead him far astray. The reason can easily be seen by 
looking at Fig. 11.8, which shows actual figures for United States 
petroleum production from 1906 to 1929, with the straight line 
trend which we just computed by the method of least squares 
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projected backward until it “strikes bottom” about 1912. It is 
immediately evident that while the trend gives a fairly good 
approximation of what happened within the period for which it 
was fitted, it gives a very poor picture indeed of what happened 
in the years preceding. The trend for 1917-1929 appeared when 
taken by itself to be linear, but we now see that it was really but a 
short section of a curvilinear trend. Often a very short section of 
a curve will appear to be a straight line. If we stick within that 
very short section, we get relatively little error by using a straight 
line; but when we try to extrapolate into other sections which we 



Fig. 11.8. United States petroleum production, 1906-1929, with least- 
squares linear trend fitted to the years 1917-1929. 

have not studied, we run a serious risk of error. This is just as 
important with regression lines as with trends. Thus we would 
not be justified on the basis of statistical information alone in 
assuming that children of six months or a year of age have blood 
pressures which follow the same rules as those which we com¬ 
puted for people of the age of 17 and above. 

The method which we have just used for cutting down our 
arithmetic is based on the fact that if we take our time origin at 
the center of our period the positive and the negative values of X 
will cancel out, and the sum of the X's will become zero. But this 
also obviously depended on the fact that we had an odd number of 
years in Table 11.10, so that there was a year, 1923, exactly at 
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the center. What would we do if we found, as we would expect to 
find half the time, that the number of periods was even? We 
would proceed about as before, taking the origin at the center, but 
now the origin would lie between two years. For example, if we 
drop the first year of Table 11.10 and use just the last 12 years, we 
get Table 11.11. The origin is now 1923}^—halfway between 
1923 and 1924. But to get away from fractional years we com¬ 
pute our X values now in units of half years instead of in units of 
years. Just as we are at liberty to shift our time origin if we 
wish, so there is nothing which compels us to compute our time 
units in years. If computations based on half years save time, 
why not use them? In Table 11.11 we note that the year 1924 is 

Table 11.11.—Computation of Straight Least-squares Trend, Even 
Number of Years 


Year 

00 

(X) 

(XY) 

(A' ! ) 

1918 

356 

-11 

-3,916 

121 

1919 

378 

-9 

-3,402 

81 

1920 

443 

-7 

-3,101 

49 

1921 

472 

-5 

-2,360 

25 

1922 

558 

-3 

; -1,674 

i 9 

1923 

732 

-1 

-732 

1 1 

1924 

714 

1 

714 

| 1 

1925 

764 

3 

2,292 

9 

1926 

771 

5 

3,855 

25 

1927 

901 

7 

6,307 

49 

1928 

901 

9 

8,109 

81 

1929 

1,007 

11 

11,077 

121 

Sums 

1 

7,997 

0 

17,169 

572 


one half year after the origin, the year 1925 is 3 half years after 
the origin, the year 1921 is five half years before the origin, etc. 
Hence we get the numbers which appear as our values of X, and 
again the positive and negative values of X cancel out. The sum¬ 
mary values from our table can now be substituted to get the 
values of a and 6 thus: 


a *= 
b - 


666.4 


SF 7997 
N * 12 
S(XF) _ 17,169 _ 
2(X 2 ) 572 
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Our equation is 

Y - 666.4 + 30.0X 

This appears to differ greatly from the equation we got when we 
included all 13 years. But we must remember that this equation 
is based on millions of barrels and on time units of half years. 
Thus our value of 6 tells us that petroleum production tended to 
increase at the rate of 30.0 million barrels per half year. Now 
that we have the equation, and have made the arithmetical sav¬ 
ings which are to be had by taking the origin at the center, we can 
easily change that origin again if we wish, and we do usually wish 
to do so. 

Our equation tells us that the trend value at time of origin 
(that is, in 1923 Yi) was 666.4, and that the trend rose 30.0 million 
barrels each half year. Obviously in 1924, a half year later, the 
height of the trend will be 666.4 + 30.0, or 696.4 million barrels. 
And if it rises 30.0 million barrels each half year, it will rise 60.0 
million barrels each year. So we can shift the origin of the 
straight-line trend easily, writing it thus: 

Y - 696.4 + 60.0A 

Origin 1924 

We are now back in terms of full years, and with a 1924 origin. 
When we computed with all 13 years, we found a tendency for 
production to rise at the rate of 58.1 million barrels a year. 
When we based our conclusions on but 12 years, we found a rise 
of 60.0 million barrels a year. Of course, the trend based on 12 
years is not precisely the same as that based on 13 years, but they 
are very nearly the same. 

11.13. The Direction of Dependence. —If we are given an 
ordinary first-degree algebraic equation in two unknowns, we are 
accustomed to substituting a value for either unknown and solv¬ 
ing for the other. For example, if we are told that 

m = 5j + 3 

we feel that we can get m from j or j from m. If we are told that 
j = 2, we substitute it in the equation and find that m = 13. If, 
on the other hand, we are told that m = 23, we substitute this 
value in the equation and find that j = 4. First we solved for 
m; then we solved the same equation for j. The mathematician 
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would say that the equation is ‘'explicit in m” and "implicit in 
j” meaning that it states directly the value of m, but that we can 
find from it indirectly the value of j. It is important for the stu¬ 
dent to understand, however, that he may not take such liberties 
with regression equations. If he has computed a regression equa¬ 
tion or a trend equation, whether by the method of selected points 
or by the method of least squares, he has computed it on the 
supposition that a particular one of the variables was dependent, 
and that this was the variable which he was to estimate. If he 
wishes to estimate the other variable, he may not use the equation 
at all, but must start again and compute an entirely different 
equation, calling his new dependent variable " Y” and his new 
independent variable "X ” 

We might try this with the data of Table 11.7 on page 313. 
When we used these data before, we called age the independent 
variable and symbolized it with X, and we called blood pressure 
the dependent variable and symbolized it with Y. Now let us 
reverse the use of symbols in that table, calling the ages Y and the 
pressures A". Then we have 

XX = 2672 XY = 719 N = 21 X(XY) = 93,698 

We have to compute a new column of A 2 , which we find by squar¬ 
ing each blood pressure (since blood pressures are now X’s) and 
adding the squares. This will give us a new column in the table. 
The sum of these squares is found to be 342,600, so we can write 

XX 2 = 342,600 

Our normal equations, as usual, are 

Na + bXX - XY 
aXX + bXX 2 « X(XY) 

Substituting the values for our problem gives 

21a + 26725 = 719 
2672a + 342,6005 = 93,698 

Solving these for a and 5 we get 

a - -73.2 5 * 0.844 

Our regression equation is 

y « - 


73.2 + 0.844X 
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This tells us that every added millimeter of blood pressure indi¬ 
cates, on the average, 0.844 added years of age. 

To distinguish the two cases, we call this regression equation 
which we have just computed “the regression of age on blood 
pressure/’ while the regression equation which we computed 
earlier was “the regression of blood pressure on age.” When we 
say “the regression of A on B” we mean that A is dependent and 
B is independent. If we investigate any two variables, we may 
be interested in either or both regressions. The two regression 



Fig. 11.9. Showing two regression lines for the same data. Use solid line* 
for estimating blood pressure from age. Use broken lme for estimating 
age from blood pressure. 

lines are plotted on Fig. 11.9, where the student can see that they 
differ sharply. If one wanted to estimate age from blood pres¬ 
sure, and used the solid line in Fig. 11.9, he would get results 
which were very far from probable except in the neighborhood of 
the point where the two regression lines cross. The two lines 
always cross at the point which represents the averages of the two 
variables. In Fig. 11.9 the crossing point of the two regression 
lines represents the average blood pressure and the average age. 

If we have let X represent age and Y represent blood pressure, 
as we did when we first started to investigate this problem, we 
could distinguish the two regression coefficients with subscripts, 
thus: 

b yx = regression of Y on X * 0.3 

b xy = regression of X on Y = 0.844 
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We shall see in our later studies that both regression coefficients 
will always have the same sign in linear regression—both will be 
positive or both will be negative. This means that both lines 
will slope in the same general direction—both tilted “up ” or both 
tilted “down ” as one moves across the diagram from left to right. 
But the sizes of the coefficients will ordinarily differ even if the 
signs are the same, and the two lines, if drawn on the same scatter 
diagram, will ordinarily not coincide. Not only is this so, but we 
can generalize even more. In the limiting case where all the 
points on the scatter diagram actually fall along a straight line, 
the regression lines will coincide, but when they do not fall along 
a straight line (and in practice they cannot be expected to do so), 
the regression of X on Y will always be steeper than the regression 
of Y on X , as illustrated in Fig. 11.9 

To repeat our point in slightly different words, if one wanted 
to estimate the most probable blood pressure for a man 78 years 
old, and used Fig. 11.9 to make his estimate, he should use the 
solid line, and quick inspection indicates that the estimated blood 
pressure would be approximately 140. But if we turn the prob¬ 
lem around, and ask for an estimate of the age of a man with a 
blood pressure of 140, we do not estimate an age of 78. We use 
the broken regression line, and estimate an age of about 46. It 
may seem to the student at first that this result is not rational. 
If we estimate that a man 78 years old has a blood pressure of 140, 
should we not estimate that a man with a blood pressure of 140 is 
78 years old? But a little reflection will show to the student that 
such estimates cannot be expected to be reversible. For example, 
the number of children in the family can be expected to be related 
positively to the age of the mother. One w^ould hardly expect, 
for example, that a woman with eight children would be only 20 
years old. Suppose you were asked to estimate the most prob¬ 
able age of a woman with 20 children. You would almost cer¬ 
tainly choose an age of 40 or more—let us say an age of 50. But 
if you were now asked to estimate the most probable number of 
children for a woman with the age of 50 you would not be likely 
to select 20. You would select intuitively a value closer to the 
average. It is this tendency of estimates to “regress” toward 
the average which leads us to speak of the estimating equations 
and lines as “regression” equations and “regression” lines. 

To make this point even more clear, let us look again at Fig. 
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11.9, and remember that whenever we estimate a value of blood 
pressure for a given age we use the solid line, while when we esti¬ 
mate an age from a blood pressure we use the broken line. Now 
let us start with a man at the age of 80, and estimate his blood 
pressure. The diagram indicates that we would estimate a blood 
pressure of about 141. Let us then take a blood pressure of 141 
and estimate the age, using the broken line. The estimated age 
is about 47. And if we start with an age of 47 we estimate a 
blood pressure of 132; and from a blood pressure of 132 we esti¬ 
mate an age of 38; and from an age of 38 we estimate a blood 
pressure of 128; etc. Our estimates of blood pressure have 
dropped successively from 141 to 132 to 128, regressing toward 
the mean. And our estimated ages have dropped from 80 to 47 
to 38, also regressing toward the mean. Scientists long ago dis¬ 
covered the general fact, which we shall evaluate in numerical 
terms in a later chapter, that when we estimate any dependent 
variable Y from a value of any related independent variable X, 
we always select a value of Y which is closer to the average than 
was the value of X with which we started. 1 We “regress” 
toward the mean. It is this fact of regression which makes it 
impossible to use the same regression equation for estimating 
values of both variables. It is the reason that the student must 
remember to use any given regression equation for estimating 
values of that variable, and of that variable only, which was used 
as “dependent” in computing the equation. 

When we first started our problem of ages and blood pressures, 
we were interested in estimating blood pressures from known 
ages—blood pressure was dependent and age independent. For 
this reason, since we were to estimate blood pressures, we used 
normal equations which minimized the errors of estimate in blood 
pressure. This we can see directly from Fig. 11.0, where we note 
that the errors are taken vertically, or we can see it again in Table 
11.9, where we see that we took the difference between the actual 
value of Y and the estimated value of Y. The errors were errors 
in F, and measured in whatever units Y was measured in. If, 

1 Of course, it iB necessary to use comparable units for comparing our 
two variables. How do we know whether a blood pressure of 141 mm. is 
greater or smaller than an age of 80 years? How can we compare 141 mm. 
and 80 years? The answer, as we learned in Sec. 7.6, is to put both variables 
in standard units. 
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now, we want to estimate values of X, we should minimize the 
errors in X, which would be the horizontal errors in our original 
diagram. The straight line which gives the smallest sum of the 
squared horizontal errors will ordinarily differ from that which 
gives the smallest sum of the squared vertical errors. In Fig. 
11.10 we see a single point of a scatter diagram and a regression 
line. If we consider that the point does not lie on the line because 
of some error in Y , then the error must be measured vertically 
along the Y axis, as shown by the vertical broken line. If, on 
the other hand, we think of our errors of estimate as being errors 
in X, we must measure our errors horizontally along the X axis t as 



Fig. 11.10. Tlie vertical errors should be minimized if we wish to estimate 
values of Y; the horizontal errors, if we wish to estimate values of A r . 

shown by the horizontal broken line. When we compute the 
regression of Y on X (which we ordinarily do, since we usually let 
the symbol } r represent the dependent variable), we are minimiz¬ 
ing errors like those represented by the vertical broken line in Fig. 
11.10; but a regression line which minimizes such vertical errors 
will not usually minimize the horizontal errors in X,,and conse¬ 
quently if we want to make estimates of X with as little error as 
possible we should use an entirely different line and an entirely 
different regression equation—a line and an equation computed in 
such a way as to minimize the horizontal errors in X. The two 
lines will coincide only in cases where there is what we shall call 
“perfect linear correlation” in Chap. XV. 

11.14. Elimination of Trend. —We noted earlier that one may 
wish to describe a trend either because he is interested in the trend 
itself or because he wishes to get rid of the trend and study those 
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movements which are left. We have seen in this chapter how one 
describes the trend with an equation, and it remains merely to see 
how one might eliminate this trend and why one might wish to do 
so. 

Reverting to the problem of petroleum production described in 
Sec. 11.12, we found that the trend could be described by the 
equation 

F = 641 + 58.IX 
Origin 1923 

Using this equation we computed the trend value for 1920 as 
466.7 million barrels, as compared with an actual production of 
443 million barrels. If we use the equation similarly to compute 
the trend value for each of the other years, we get the values 
shown in the next to the last column of Table 11.12, which com- 

Table 11.12—Deviations from the Least-squares Trend 


1 


Year 

Actual 

Production 

Trend Value 

Residual 

(X) 

(Y) 

in 

(y - n 

1917 

335 

292 4 

+42 6 

1918 

356 

350 5 

+ 5 5 

1919 

378 

408.6 

-30 6 

1920 

443 

466.7 

— 23 7 

1921 

472 

524 8 

-52.8 

1922 

558 

582 9 

-24 9 

1923 

732 

641 0 

+91 0 

1924 

714 

699 1 

+ 14 9 

1925 

764 

757 2 

+ 68 

1926 

; 771 

815 3 

-44 3 

1927 

901 

873 4 

+27 6 

1928 

901 

931 5 

-30 5 

1929 

1007 

l 

989 6 

+ 17.4 


pares these trend values or “normal” values with the actual 
values which are repeated from Table 11.10. Figure 11.11 shows 
both the original values and the trend values. 

The values of the last column of Table 11.12 show, not the 
actual petroleum production in any year, but the amount by 
which the actual production differed from the trend value. They 
are found in any year by subtracting the trend value for that year 
from the actual value. If we think of the trend as representing 
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the natural or normal or expected rate of production, then these 
figures show by how much the actual production differed from 
normal. We can best illustrate, perhaps, by comparing the pro¬ 
duction of two years. In 1928 the production was 901 million 
barrels, and in 1923 it was 732 million barrels. Evidently, 1923 
was a year of comparatively low production and 1928 a year of 
high production. But this is not so if we consider the trend to 
represent normal production, for in 1923 the output was 91 million 



Fig. 11.11. United States petroleum production, 1917-1929, with least- 
squares straight-line trend, and with deviations from trend indicated. 

barrels above normal and in 1928 the output was 30.5 million 
barrels below normal. To be sure, the 1928 output Was larger 
than that of 1923, but there had been a general tendency for pro¬ 
duction to increase during the interim, and it had not increased so 
much between 1923 and 1928 as one should expect. 

If we took a year of exceptionally large potato production for 
the United States in the decade 1840-1850 and a year of abnor¬ 
mally low potato production in the decade 1920 1930, we should 
almost certainly find that the “low” production of the twentieth 
century was larger (in terms of bushels) than the “high” pro- 
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duction of the nineteenth century. Whether a production figure 
is high or low is a relative matter. When we say that petroleum 
production was high at any time we mean that it was higher than 
was to have been expected at that time; that is, it was above the 
trend. The price of potatoes would, in all probability, have been 
low in the year of “high” production in the 1840’s; and when we 
had the year of “low” production in the 1920’s, the price would 
presumably have been high. The fact that the 1920 production 
was greater than the 1840 production would not make prices low; 
the fact that the 1920 production was below the trend might 
make the prices high. For this reason we are often primarily 



— — — C--J P-J r sj rvj CNJ CVJ fvi C-o r*J 

cr C£ 2 2; ft; O; cr o cr cr ct cr 

Ye n r 

Fig. 11.12. Deviation of United States petroleum production from straight- 
line trend, 1917-1929. Figures from table on page 334 


interested in getting rid of the trend entirely in order that we 
may study the deviations from the trend. 

Figure 11.12 shows the deviations of petroleum production 
from the trend for the years 1917-1929. It is a graph of the 
figures in the table on page 334. One might well be interested 
in trying to explain how it happened that the 1923 output was 
so very large as compared with the trend and why the 1928 out¬ 
put (which was actually larger) was so very small as compared 
with the trend. It will be noted that the entire tendency for 
production to increase has disappeared; the tendency for the 
values on the chart to rise toward the right is gone. The trend 
has truly been “eliminated,” and merely the deviations are left. 
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11.15. Suggestions for Further Reading.—The following chapter con¬ 
tinues the discussion of trends and regression lines, extending it to cases 
where the relationship is curvilinear. Suggestions for further reading are 
given at the close of that chapter to cover both linear and curvilinear trends 
(see Sec. 12.15). 

EXERCISES 

1. In each of the following pairs, consider the first as the independent 
and the second as the dependent variable. Would you expect, in each case, 
to find a positive relationship, a negative relationship, or no relationship? 

o. Marks received by students in mathematics courses. Marks received 
by same students in physics courses. 

b. Marks received by students in mathematics courses. Marks received 
by same students in courses in English literature. 

c. Lengths of men’s forearms. Heights of same men. 

d. Lengths of men’s forearms. Lengths of same men’s eyelashes. 

e. Number of children in families. Annual amount of money saved by 
same families. 

/. Number of hours students have practiced typewriting. Number of 
errors made by same students on a typing test. 

g. Speed in miles per hour at which automobiles are driven. Number of 
miles same automobiles run before first tire wears out. 

h . Age at which student enters high school. Marks received by student 
in freshman year. 

i. Numbers of books in libraries of colleges. Numbers of students 
enrolled at same colleges. 

j. The age of the moon (number of days since new moon). The amount 
of rainfall on the day. 

2. Are there particular pairs of variables in the preceding problem where 
you would expect that the relationship would be curvilinear? Why? 

3. Find an actual case of a trend which is at least roughly linear. Fit a 
straight trend to it by freehand methods. Find the equation of the trend 
by the method of selected points. 

4. A study of car-lot shipments of onions into the state of Connecticut 
from 1917 to 1924 shows a straight-line trend which can be described as 
follows: 

Y * 304 4 + 13. IX 

Origin 1920 

Interpret each of the figures in this equation. What would be the trend 
value of car-lot shipments in 1926? 1 

5. In the preceding problem, what was the trend value in 1895? Com¬ 
ment. 

6. The arithmetic mean of a series of values is fitted to them by the 
method of least squares, although this fact is not known to most people 
who compute it. Since it is so fitted, it must be true that the sum of the 

1 Data from F. V. Waugh, Connecticut Market, Demand for Vegetables, 
Storrs Agricultural Experiment Station Bulletin 138, p. 34. 
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squares of the deviations of the individual values from the arithmetic mean 
is smaller than the sum of the squares of the deviations of the items from 
any other value. Test this out. For example, the mean of the numbers 
5, 7, 12, 2, and 4 is 6. Find the deviations of the five values from 6, and 
the sum of the squares of these deviations. Compare this sum with the 
sum of the squares of the deviations from any number other than 6. Try 
5 and 7, for example, or 6.1. 

7. Figure 11.13 shows a straight-line trend. Compute its formula by 
the method of selected points. 

8 . Change the origin of the equation in Exercise 4 to the year 1924, and 
restate the equation with the new origin. 



Fig. 11.13. This chart shows a straight-line trend. Compute its formula 
by the method of selected points. 

9. The statement is often made that the line fitted by the method of least 
squares gives the “best fit." Actually this statement needs several qualifi¬ 
cations What are they? 

10. Suppose in a given study our variables are 

X = numbei of decades before oi after the origin 
Y = cotton production in bales 

Our trend equation turns out to be 

Y » 759 + 23X 
Origin 1870 

Imagine that you want to reconstitute the equation to put the origin in 
1900 You want also to represent by Y the number of pounds of cotton. 
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and you want X in years rather than in decades. There are 500 lb. of cotton 
to the bale. Write the new trend equation. 

11. Look up some data which you think can be expected to show a fairly 
close relationship. Plot the scatter diagram. Describe the relationship 
shown on the diagram. 

12. Table 11.13 gives the ages and weights of 20 preschool boys, (a) 
Plot the data on a scatter diagram, using age as the independent variable. 
(b) Draw a freehand regression line and compute its equation by selected 
points. ( c ) Fit a least-squares straight line to the data. ( d) Interpret 
each number in your least squares equation. What weight would you 
predict for a boy 25 months old? 


Table 11.13.- -Anne and Weights of Boys 


Child 

Age 

Weight 

(months) 

(pounds) 

Charles 

36 

33 

William 

30 

28 

Orville 

! 21 

25 

Wilbur 

42 

33 

Andrew 

45 

35 

George 

48 

35 

Abraham 

30 

26 

Thomas 

24 

25 

Theodore ' 

18 

23 

John. i 

33 

28 

Robert 

18 

25 

Dan 

30 

30 

Peter 

60 

40 

James 

54. 

38 

Donald 

57 

40 

Henry 

36 

30 

Leopold 

15 

22 

Philip | 

27 

26 

Richard 

51 

37 

Rupert 

39 

32 


IS. In the preceding problem, how do you account for the fact that the 
points on the scatter diagram do not all lie exactly on the straight line? 

14. From your equations of Exercise 12, estimate the weight of a boy 
30 months old. Actually three of the boys in the table (William, Abraham, 
and Dan) are 30 months old. Does your estimate correspond with the 
weight of any one of them? Since no single estimate can possibly be correct 
for all of them, what value do you try to select as an estimate? 
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CURVE FITTING 

In the preceding chapter we have learned how to find straight 
lines and their equations for use in describing relationships. But 
there is no reason to suppose that all relationships are linear. 
Many a scatter diagram shows a definite curvilinear pattern. 
In these cases also we find lines and their equations to describe 
our data, but now the lines are curved. The statistician calls 
the process of finding lines which describe the basic pattern in 
scatter diagrams and in historical data the process of curve fitting , 
using this expression to include both straight lines and curved 
lines. To the layman a straight line is not a curve. To the 
statistician it is the commonest sort of curve—a limiting form 
which many other curves approach. We have found it conven¬ 
ient in our chapter headings, however, to follow the lay rather than 
the technical usage, and we turn now to a consideration of the 
methods which are employed when our scatter diagram or our 
historical series shows a curvilinear pattern. 

12.1. Discovering Curvilinearity.—We discover that data 
show curvilinear relationship just as we discovered the existence 
of linear relationship—by the computation of group averages and 
by the use of scatter diagrams. For example, Table 11.4 on page 
297 shows a tendency for the divorce rate to increase during the 
first few years of marriage, reach a peak after 3 or 4 years, and 
then fall lower and lower. We cannot say that the relationship is 
either positive or negative, since at first it is positive and then 
negative. If we plot the averages of Table 11.4 we get Fig. 12.1, 
which makes the curvilinear nature of the relationship immedi¬ 
ately evident. A freehand smooth curve has been drawn through 
the dots of the diagram to guide the eye. 

Instead of using group averages (or preferably in addition 
thereto) we may use scatter diagrams. The points in Fig. 12.2 
are not closely bunched, and their pattern may not be immedi¬ 
ately apparent. It is clear at once, to be sure, that the relation- 
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Fig. 12.1. Curvilinear relationship from data of Table 11.4, shown with 
freehand line. 



Production,hundred$ of millions of bushels 


Fig. 12.2. Relationship between potato production in 27 late-crop states 
and the price of potatoes at Minneapolis and St. Paul, by years, 1906-1918. 

ship is negative, with large productions bringing low prices. But 
a little study of the diagram will indicate that the price tends to 
drop very rapidly toward the left-hand side of the diagram, and 
more slowly as we move toward the right. The relationship is 
curvilinear. Figure 12.3 shows the same diagram with a rough 
and approximate freehand curve drawn through it, just as we 
have drawn freehand straight lines heretofore. In such cases we 



342 


ELEMENTS OF STATISTICAL METHOD 


attempt to draw the line in such a way that it would picture the 
relationship as accurately as possible if it were used alone without 
the dots. No simple smooth curve can be drawn which will pass 
through all the dots on the diagram. It is our purpose, as before, 
to show the basic underlying pattern rather than the idiosyncra- 
cies and peculiarities of the individual cases. 

12.2. Common Curve Types.—The student of algebra will 
recall the fact that various algebraic equations can be pictured by 
curves with various characteristics. We have already made use 



Fig. 12.3. Late-crop potato production and Twin Cities price, 1900-1918, 
with freehand regression line. 

of the fact, for example, that any straight line can be described 
by an equation of the general form 

Y = a + bX 

where a and b are constant values. For any given straight line, 
we vary the values of X and find corresponding variation in the 
values of Y. But when we compare different straight lines, we 
find that a and 6, which were constant for the particular curve, are 
now variable. They differ for different straight lines, and the 
actual values which they take determine which straight line we 
have. These constants in the equation we call the parameters of 
the curve, and in the case of the straight line we found that the 
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parameters could be determined either by the method of selected 
points or by the method of least squares. 

Many other mathematical equations in X and Y can be 
described by lines. In fact, there is no limit to the number of 
different types of lines which can be so described. Fortunately 
for us, however, it seems to be possible to describe a very large 
proportion of the actual relationships which exist by means of 
some one of a very small number of rather simple equations. In 
each case we have a type equation , which determines the general 
characteristics of the curve, just as the equation Y = a + bX is 
the type equation for straight lines. This type equation does not 
represent any single line alone, but a whole family of lines—the 
entire family of straight lines. Similarly, each of our other type 
equations represents a whole family of curves, and we have the 
problem of finding the values of the particular parameters for the 
particular curve of the family which fits our data. In the actual 
process of fitting the curve, if our original data have evenly spaced 
values of X, we always take the origin at the center of the data to 
reduce our arithmetical work, as we did when we fitted the 
straight line to the figures of petroleum production in Bee. 11.12. 
With the curves which we are now to study, however, there is no 
simple and easy way to shift the origin of the curve after we have 
computed it; so we leave the equation with the origin at the center 
if we computed it that way. With historical data the X values 
are usually evenly spaced, but with other data this is not usually 
true, and in such cases we proceed by the longer method. This 
will be illustrated as we describe the processes in more detail. 

While, as we have said, there is no limit to the number of differ¬ 
ent families of curves which could be fitted, we find in practice 
that we can usually describe our data by one or another of six 
families, which are called: 

1. The straight line. 

2. The second-degree parabola 

3. The third-degree parabola 

4. The semilogarithmic curve. 

5. The logarithmic curve. 

6. The reciprocal curve. 

It is true that occasionally in specialized and advanced work the 
statistician will need to use other equations, but we shall confine 
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our attention to these six simple cases which will suffice for all 
ordinary work. The first of them we have already discussed in 
the preceding chapter, and we shall now proceed to take up the 
other five, one at a time. 

12.3. The Second-degree Parabola. -The type equation of the 
second-degree parabola is 

Y = a + bX + cX* 


We note that the equation is identical with that of the straight 
line except for an added term, cX 2 . It is this second power of X 
which gives the curvature to the line, and accounts for its charac¬ 
teristic shape. We note that there are three parameters a, b> and 
c. Since it takes three equations to evaluate three unknowns, we 
can see that it would be necessary to select three points for the 
method of selected points, or to use three normal equations when 

using the method of least squares. 1 
Second-degree parabolas show a 
continuous curve which, through¬ 
out its length, is everywhere either 
concave upward (like the cross sec¬ 
tion of the side of a teacup) or con¬ 
cave downward (like the cross 
section of the side of a derby hat). 
Typical second-degree parabolas are 
shown in Fig. 12.4. Whether the 
curve is concave upward, like the 
solid curve in the figure, or concave downward, like the broken 
curve, depends on the value of c in the equation. 2 If the value of 
c is positive, the curve is concave upward; if the value of c is nega¬ 
tive, the curve is concave downward. The curve may show a 
maximum or a minimum point, as do the curves in the figure, or it 
may be that all our data are on the rising or on the falling portion 
of the curve. In either case, the entire curve should show one 
continuous bend, without any reversals or points of inflection. 



Fig. 12.4. Typical second-de¬ 
gree parabolic curves. 


1 In general, for any family of curves, we must select as many points or 
use as many normal equations as the number of parameters in the type 
equation. 

* Just as, in the case of straight lines, the sign of 6 in the equation deter¬ 
mines whether the line has positive or negative slope. 
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We can illustrate with data showing petroleum production in 
the United States from 1906-1929, which are given in Table 12.1. 

Table 12*1.— Production of Crude Petroleum in the United States, 

1906-1929 


Year 

Output 
(millions 
of barrels) 

it 

Year 

Output 
(millions 
of barrels) 

1906 

126 

1918 

356 

1907 

166 

1919 

378 

1908 

179 

1920 

443 

1909 

183 

1921 

472 

1910 

210 

1922 

558 

1911 

220 

1923 

732 

1912 

223 

1924 

714 

1913 

248 

1925 

764 

1914 

| 266 

1926 

771 

1915 

281 

1927 

901 

1916 

301 

1928 

901 

1917 

335 

1929 

1007 


These data are shown graphically in Fig. 12.5, where it is apparent 
that the data are not linear, but show in general a continuous 
curve with upward concavity. Our first move in such cases is to 
draw carefully a freehand curve which shows the general pattern 
of the data. This has been done in Fig. 12.6. If, now, we are to 
find the equation of the curve by the method of selected points, 
we must select three points on this freehand curve (three points 
since there are three parameters in the type equation of the 
second-degree parabola) and substitute their values in the type 
equation. Note that we select as our points not three of the 
original pairs of values in Table 12,1, but the values of three 
points on our freehand line. The three points selected should be 
approximately evenly spaced, one near each end and one near the 
middle. With the method of selected points, it is customary to 
take as the origin, when we are dealing with times series as we are 
here, the middle one of the selected points. Let us select from 
our freehand line the values for the years 1908, 1920, and 1928, 
taking our origin, in accordance with the rule we have just given, 
in 1920. This makes the year 1919 the year —1, the year 1924 
the year 4, etc. In the years which we have selected the curve 
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Fia. 12.5. United States petroleum production, 1906- 1929. (Dai 

from Table 10.3.) 
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seems to have the values 170, 425, and 930. These values are 
read from the diagram. The three pairs of values are, then, 


X = 

-12 

and 

Y = 170 

X = 

0 

and 

Y = 425 

X = 

8 

and 

Y = 930 


To get our observation equations we substitute these values of X 
and Y in the type equation of the second-degree parabola, 

Y = a + bX -f cX 2 

This gives the three observation equations: 

170 = a — 125 + (—T2 2 )c 
425 = a + 0 + 0 
930 = a + 85 + S 2 c 

It is at once evident that the value of a is 425. (It aho becomes 
evident immediately that it saves computation to take as the 
origin the year of our central observation!) Solving, we find 
that the values of 5 and c are 

b - 46.33 
c - 2.09 

Substituting these values of a , 5, and c in the type equation 
(Y = a + bX 4- cX 2 ), we get the equation of this particular 
curve, which is 

Y = 425 + 46.33X + 2.09X 2 
Origin at 1920 

When such an equation is given, it is important that the origin be 
stated with the equation. Otherwise the results are meaningless. 

From our type equation let us estimate the trend value for 
1910. When 1920 is the origin, the year 1910 becomes the year 
— 10; that is, X becomes “10 when we wish to estimate the 
1910 production. Substitute —10 for X in the type equation 
and it becomes 

F = 425 + 46.33( —10) + 2.09(-10 2 ) = 170.7 

Thus our estimate of petroleum production in 1910 is 170,700,000 
barrels (since our production figures are in millions of barrels). 

This equation now describes our curvilinear trend, and we can 
easily and accurately tell others our conclusions. 
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It must be remembered, however, that two investigators fitting 
trends by the method of selected points may well obtain some¬ 
what different results. The freehand trends which they draw 
originally to guide them may well differ somewhat, and the points 
they select from which to get values for their observation equa¬ 
tions may differ. Thus this combination of methods (freehand 
trend plus selected points) has the disadvantage that two equally 
competent workers may differ in their conclusions. The methods 
are, however, quick and easy to apply; and every statistician 
finds them useful at times. 

Despite these advantages, the statistician often prefers the 
objectivity of the method of least squares, and the fact that this 
method will give him (if the residuals are normally distributed) 
the particular second-degree parabola which is more likely to be 
right than any other second-degree parabola. To fit this curve 
by the method of least squares we need three normal equations, 
since we are looking for three parameters. The normal equations 
are 1 

1 The normal equations can be easily derived, like those for the straight 
line on page 312. The type equation is 

Y = a + bX -f cX 2 

Each actual Y will differ from the estimate by some deviation d (which 
may equal 0). Thus 

Y + d - a + bX + cX 2 

d - a + bX + cX* - V 
d 2 - (a + bX 4- cX 2 - Y )* 

/ = 2d 2 « S(a + bX + cX 3 - Y) 2 

The partial differentials of this function with respect to a, b , and c must be 
set equal to 0 if we are to minimize the sum of the squared residuals. That 
is, 

^ = 22(o]+ bX + cX* - Y) ~ 0 
K - 22(0 + bX + cX‘ - Y)X = 0 
~ “ 22(o + bX + cX* - Y)X i - 0 

Canceling the 2’s, expanding, and summing, we get 

Na + 62X + c2X s - 2F - 0 
o2X + 62X* + c2X 3 - ZXY = 0 
o2X a + bSX‘ + cX 2* - 2X a T - 0 

Transposing the final terms of each of these equations, we get the normal 
equations as given at the top of the next page. 
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Na + bXX + cSX 2 - XY 
aXX + bXX 2 + cXX z - S(IF) 
aSX 2 + bXX* + cXX* = 2(X 2 F) 

In order to find the values of the sums called for in the normal 
equations, we set up our data as in Table 12.2, taking the origin at 


Table 12.2.— Computation of Second-degree Parabolic Trend of 
Petroleum Production, 1906-1929 


Year 

Out¬ 

put 

(Y) 

Year 

(origin, 

1917.5) 

(X) 

X* 

XY 

X*Y 

X3 

X* 

1906 

126 

-23 

529 

-2,898 

66,654 

-12,167 

279,841 

1907 

166 

— 21 

441 

-3,486 

73,206 

- 9,261 

194,481 

1908 

179 

-19 

361 

-3,401 

64,619 

- 6,859 

130,321 

1909 

183 

-17 

289 

-3,111 

52,887 

- 4,913; 

83,521 

1910 

210 

—15 

225 

-3,150 

47,250 

- 3,375 

50,625 

1911 

220 

-13 

169 

-2,860 

37.180 

- 2,197 

28,561 

1912 

223 

-11 

121 

-2,453 

26,983 

- 1,331 

14,641 

1913 

248 

- 9 

81 

1-2,232 

20,088 

-729 

6,561 

1914 

266 

- 7 

49 

-1.862 

13,034 

-343 

2,401 

1915 

281 

- 5 

25 

-1,405 

7,025 

-125 

625 

1916 

301 

- 3 

9 

-903 

2,709 

-27 

81 

1917 

335 

- 1 

1 

-335 

335 

-1 

1 

1918 

356 

+ 1 

1 

356 

356 

1 

1 

1919 

378 

3 

9 

1,134 

3,402 

27 

81 

1920 

443 

5 

25 

2,215 

11,075 

125 

625 

1921 

472 

7 

49 

3,304 

23,128 

343 

2,401 

1922 

558 

9 

81 

5,022 

45,198 

729 

6,561 

1923 

732 

11 

121 

8,052 

88,572 

1,331 

14,641 

1924 

[ 714 

13 

169 

9,282 

120,666 

2,197 

28,561 

1925 

! 764 

15 

225 

11,460 

171,900 

3,375 

! 50,625 

1926 

! 771 

17 

289 

13,107 

222,819 

4,913 

' 83,521 

1927 

901 

19 

361 

17,119 

325,261 

6,859 

L 130,321 

1928 

901 

21 

441 

18,921 

397,341 

9,261 

i 194,481 

1929 

1,007 

23 

529 

23,161 

532,703 

12,167 

279,841 

Totals 

10,735 

0 

4,600 

85,037 

2,354,391 

o 

1,583,320 


the center, which is at the year 1917.5 (halfway between 1917 and 
1918). This gives the values for the individual years shown in 
the column headed ( X ). The arithmetical work is tedious but 
not difficult, and the advantage of taking the origin at the center 
is immediately apparent. With this origin it is unnecessary to 
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add two of the columns at all, and we can get the sums of two 
others quickly by adding half the column and multiplying the sum 
by 2, since the last half of the column merely duplicates the first 
half. 

Let us now substitute the totals at the proper points in the 
normal equations. We get 

24a + 0 b + 4600c - 10,735 
0a + 46005 + 0c - 85,037 
4600a + 0 b + 1583320c = 2,354,391 

From the second of the normal equations we see at once that the 
value of b is 18.5. Solving for the others, we find the three values 
to be • 

a = 366.2 
b = 18.5 
c « 0.4231 

If we substitute these values in the type equation 
(F - a + bX + cX 2 ) 

we get 

F = 366.2 + 18.5X + 0.423X 2 
Origin 1917.5 
Deviations in half years 

It is necessary in this case to state the origin and also the fact 
that the deviations are in half years. Let us now estimate the 
petroleum output for the year 1927 by this formula. The year 
1927 is 9.5 years after the origin, but since we are measuring in 
half-year units we must convert the 9.5 years to half years, getting 
19 half years for our value of X. This we substitute for X in 
the formula to get 

Y = 366.2 + 18.5(19) + 0.423(19 2 ) = 870.4 

Thus this formula gives an estimate of 870.4 million barrels for 
1927. If we estimate the trend value for each of the 24 years by 
substituting the various values of X in our formula, and if we 
locate the estimates on a graph of the data, we can easily draw 
our parabola through the points so estimated, as in Fig. 12.7. 
In this figure we show the original data with both the straight 
line and the second-degree parabola fitted by least squares. It 
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is apparent from inspection that we cannot say that both lines 
are lines of “best fit.” The straight line gives a very poor fit 
indeed, being too low at the ends and too high in the middle. It 
is not the best fitting line, but the best fitting straight line. It 
has a greater likelihood of being correct, if the residuals are nor¬ 
mally distributed, than does any other straight line. But no 
straight line has much likelihood of being correct. The student 



Fig. 12 7. United States petroleum production, 1906-1929, with parabolic 
and straight-line trends. Both trends are fitted by least squares. 


must remember that the method of least squares does not auto¬ 
matically give him the best line. Under the assumptions which 
we explained earlier, it will give the most likely line of the family — 
more likely to be right than any other straight line, or, if we fitted 
a second-degree parabola, more likely to be right than any other 
second-degree parabola. We still have to make subjective judg¬ 
ments in deciding what sort of curve to fit, and when we once 
decide to fit a given type of curve the methods will force our 
answer to be in the form of that sort of curve willy nilly. 

One of the peculiarities of all second-degree parabolas is that 
they pass through a minimum point (if concave upward) or a 
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maximum point (if concave downward). This is true of all 
second-degree parabolas, although the maximum point or mini¬ 
mum point may occur outside the limits of the data which we 
have studied. In some cases our original data will show a maxi¬ 
mum or a minimum, but in other cases, as in Fig. 12.7, this is not 
true. The petroleum production rises throughout, with no mini¬ 
mum within the period of our study. Yet if we extrapolate the 
curve it will pass through a minimum point, since it is concave 
upward. The maximum or the minimum is always found at the 
point where X = —b/2c. In the case which we have just 
studied, where b = 18.5 and c = 0.423, we find the minimum 
point where X = —18.5/(2)(0.423) = —21.9 half years. With 
our origin at 1917.5 this represents the year 1906.55; so we know 
that if we extended our parabola backward it would drop lower 
and lower until it reached a minimum point between 1906 and 
1907, and if we projected it back farther still it would begin to 
rise as we went backward until it led us to estimate greater and 
ever greater amounts of petroleum production as we went back 
into antiquity and prehistory. This is no fault of the curve. 
It is, rather, a fault of the statistician who does not understand 
the dangers of extrapolation against which we have already given 
warning. Within the period which we have studied, the curve 
gives estimates of petroleum production which are reasonable and 
fairly close to actual values. Outside the period, this same curve 
may give (and in this case does give) values w r hich are ridiculous. 

12.4. The Third-degree Parabola. —Just as the second-degree 
parabola differs from the straight line by the addition of a new 
term to its type equation, so the third-degree parabola differs by 
the addition of still another term and another parameter. Its 
type equation is 

Y = a + bX + cX 2 + dX 8 

There are four parameters, a, 6, c, and d, and consequently we 
need four points on the freehand curve if we are to use the method 
of selected points, or four normal equations if we are to use the 
method of least squares. Third-degree parabolas have a reverse 
curve, or /8-shape, like those of Fig. 12.8. Each such curve has 
both a maximum and a minimum point. When the value of d in 
the equation is positive, the maximum comes first, followed by 
the minimum, like the solid curve of Fig. 12.8. When the value 
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of d is negative, the minimum precedes the maximum, as in the 
broken curve of Fig. 12.8. 

With the method of selected points we draw a freehand line, 
select four points roughly evenly spaced (ordinarily choosing two 
of the points near the maximum and the minimum if they are 
evident) and determine by inspec¬ 
tion the values of X and Y at these 
four points. We then substitute 
these observed values of X and Y in 
our type equation 

Y = a + bX + cX 2 + dX 3 



Fia. 12.8. Typical third-de¬ 
gree parabolic curves. 


This gives us four equations with 
four unknowns, which we solve for 
the parameters o, b , c, and d. 

For the method of least squares we would require four normal 
equations, as follows: 


Na + bZX + c2X 2 + dSX® = 27 
aSX + bZX 2 + cSX 3 + dSX 4 - 2(XY) 
a2X- + bXX* + cXX A + d2X 6 - 2(X 2 7) 
oSX* + 6 2X 4 + c2X 6 + dZX 6 = 2(X 3 7) 


To find the values for substitution in this equation we would set 
up a table showing our original values of X and Y and also 
columns of X 2 , X 3 , X 4 , X 6 , X7, X 2 7, and X 3 7. The sums of 

these columns would be substituted in our normal equations, 
which would be solved to find the values of a, b, c, and d . The 
process is straightforward, the arithmetic is simple, but the work 
is endless, or seems so. When we get this far it is easy to see why 
the statistician seldom uses parabolas of higher than the third 
degree, although it is possible to extend the process to fourth- 
degree, fifth-degree, and even higher parabolas. For example, 
the equation of a fifth-degree parabola would be 


Y = a + bX + cX 2 + dX 8 + eX 4 + fX* 


This would require six points and the solution of six simultaneous 
equations. And if we used the method of least squares, it would 
involve the computation of all integral powers of X up to and 
including the tenth and cross products up to and including X 5 7. 
With parabolas of higher degrees there is one more bend in the 
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curve for each extra parameter. The straight line (which we can 
think of as a first-degree parabola if we wish) has two parameters 
and no bends. The second-degree parabola has three parameters 
and one bend. The third-degree parabola has four parameters 
and two bends. The nth-degree parabola has n + 1 parameters 
and n — 1 bends. The scientist, however, is ordinarily trying to 
get rid of the bends and wiggles and convolutions in the line, and 
to smooth out the irregularities. He assumes that the under¬ 
lying pattern is a simple one, and for that reason, as well as 
because of the tedious arithmetical work, he seldom uses parab¬ 
olas above the third degree. 

12.5. The Semilogarithmic or Exponential Curve. —The type 
equation of the semilogarithmic curve may be written in several 
forms, but let us start with a form which is analogous to that 
which we have already used for our other curves. This is 

(1) log Y ~ a + bX 

Evidently this is exactly the same as the formula for a straight 
line except for the fact that we have used logarithms of F instead 
of the actual values of V themselves. This function is a curve 
when we plot values of X and F, but it is transformed into a 
straight line if we plot values of X and log F. We shall make use 
of this linear transformation later. 

The type equation is sometimes written as 

(2) F - AB X 

In this case we have used capital letters for A and B ) the two 
parameters, merely to make sure that the student understands 
that they are not the same numerical values as those of Eq. (1). 
In fact, if we start with Eq. (2) and take logarithms of both sides 
we get 

(3) log F - log A + (log B)(X) 

Since A and B are constants, it must also be true that their loga¬ 
rithms, log A and log B , are constants. Let us write, then, log 
A = a and log B = b. Substituting these in Eq. (3) gives us 
immediately Eq. (1), so that we see, first, that the two equations 
are equivalent and, second, that the values of a and b which we 
derive if we use form (1) will be the logarithms of the values of A 
and B which we derive if we use form (2). 
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The semilogarithmic curve has the general form shown in Fig. 
12.9. If the type equation is written in form (1), we get the rising 
solid curve of Fig. 12.9 when the value of b is positive and the fall¬ 
ing broken curve when the value of b is negative. If the type 
equation is written in form (2), we get the rising solid curve when 
the value of B is greater than 1 and the falling negative broken 
curve when the value of B lies between zero and 1. Both forms 
of the curve are monotonic; that is, the curve either rises through¬ 
out its length or falls throughout 
its length, unlike the second- 
degree parabola which is positive 
in part and negative in part. 

When the type equation is 
written in form (2), the value of 
A is the value of Y at the time 
of origin, or when X = 0; while 
the value of B tells us indirectly 
the rate of increase or decrease. 

If the value of B is greater than 
1, the curve increases, and the 
percentage rate of increase can be 
found by subtracting 1 from the 
value of B. Thus, for example, if the equation were 

Y = 76(1.295)* 

we would know that Y has a value of 76 when X is zero and that 
Y increases 29.5 per cent each time X increases one unit. (If we 
subtract 1 from 1.295 we get 0.295, and 0.295 is 29.5 per cent just 
as 0.29 is 2 %oo or 29 per cent.) If the value of B is smaller than 
1, we subtract the value of B from 1 and discover the percentage 
rate of decrease. Thus if the equation were 

Y = 16(0.94)* 

we would subtract from 1 and get 0.06, which would show that 
the curve was dropping 6 per cent each time X increased one unit. 
When X = 0 this curve has a height of 16. One can see easily 
from these facts that the semilogarithmic curve is one in which 
there is a constant percentage rate of increase or decrease. It 
shows the way in which a sum of money increases if put at com¬ 
pound interest and is therefore sometimes called the compound 



Fig. 12.9. Typical semilogarith¬ 
mic curves. 
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interest law, or the C./.L. Also since X appears as an exponent it 
is called an exponential function. It is a very common and useful 
type of relationship, which is found in almost every science. 

Let us illustrate the fitting of this curve with the data of Table 
12,3, which shows a number of observations of atmospheric pres¬ 
sure at various heights above the surface of the earth. Even a 


Table 12.3.— Atmospheric Pressure at Various Distances above 

Sea Level 


Height 

(miles) 

Pressure 

(inches) 

Height 

(miles) 

Pressure 

(inches) 

0 

28 9 

4.7 

8.5 

0.8 

26 9 

5.4 

10 0 

1 4 

20.4 

5.0 

7 6 

1.8 

21.9 

6 8 

6 9 

2 6 

19 1 

7.1 

6 2 

2.6 

16 2 

7.3 

4.8 

3 3 

13.2 

8 0 

5 6 

3.7 

13 2 

8 4 

4 1 

4.0 

11.2 • 

8 4 

4 6 

4 7 

11.2 

9 5 

3 7 


quick look at the table shows that the relationship is negative, 
with the pressure decreasing as the altitude increases. If we plot 
the data on a scatter diagram as in Fig. 12.10, we note that the 
points lie in a band which falls as we move from left to right; but 
the band is not a straight one. It falls more rapidly at first, and 
more slowly later. A broken freehand regression line has been 
drawn through the swarm of points to guide the eye. 

If we want to find the equation of this line by the method of 
selected points we must select two points from the freehand line, 
since our type equation has two parameters. Let us first use the 
type equation in its form (2), where 

Y = AB X 

We should select one point near each end of the line. If we can 
select a point where X = zero, it will reduce our arithmetic, but 
it is not necessary, and should not be done if it involves any con¬ 
siderable extrapolation. Let us select points where X = 0 and 
where X » 10. For accurate reading of the height of the line we 
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would prefer to have our scatter diagram plotted on finely divided 
cross-section paper using a large scale; but doing as well as we can 
from the rough diagram of Fig. 12.10 we estimate that when 
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Fig. 12.10. Atmospheric pressures at various heights above sea level. 
The curve is drawn freehand. 


X = 0, V = 29; and when X = 10, Y = 3. Substituting these 
values in our type equation we get the two observation equations 

29 - AB° and 3 = AB 10 


From the first equation we see that A — 29. Substituting this 
value of *4 in the second equation gives 3 = (29)B ln . Putting 
this in logarithmic form, we get log 3 = log 29 + 10(log B) 

0.477 - 1.462 + 10(log B) 

10(log B) = -0.985 
log B = -0.0985 
B = 0.797 


Rounding off this last figure, since we have used rough original 
readings from the chart for our original values, and cannot now 
expect three-place accuracy, we get 


Y - (29)(0.8) x 
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We reduced our work somewhat in this case by selecting one 
point where X was zero. Since this is not always possible, let 
us repeat the same problem using as selected points the values 
F = 19 when X — 2; and Y — 7 when X = 7. These two pairs 
of values are also read very roughly from Fig. 12.10. Substituting 
then in our type equation gives 


19 = AB 2 and 7 = AB 1 


Dividing the second equation by the first we get 


7 __ 

19 


AB 1 
AB 2 


= B b 


If B 1 ' = — 0.3H8, we can write 


log 0.308 = 5 log B 
log B = 9.913 - 10 
B = 0.82 


And substituting this back we find that A - 28.3. Our equation 
is, then, 

Y = 28.3(0.82)* 

We start with a value of 28.3 and fall off 18 per cent for each mile 
we rise; or, if we prefer the first equation which we worked out, we 
start with a pressure of 29 in. at sea level and the pressure drops 
20 per cent for each mile we rise. Roth results are very rough on 
account of the carelessness with which our original values were 
read from the line. Yet the two results are consistent. 

If we wish to compute an equation more carefully by the 
method of least squares, we need two normal equations, since we 
are seeking two parameters. This time we shall use form (1) of 
the type equation, and the necessary normal equations are 

Na + bXX - 2(log F) 
aXX + b2X 2 = 2(X log F) 

The student will note that these are exactly like the normal equa¬ 
tions for a straight line except that we have used log F wherever 
we used F before. This is natural, since our type equation is 
exactly the same as the type equation of the straight line except 
that log F is substituted for F. 



CURVE FITTING 


359 


To get the values needed for the normal equations we make up 
a table like Table 12.4, where our first two columns give the 
original values of height and atmospheric pressure. Since, how¬ 
ever, we shall need the logarithms of 7, they are given in the third 
column. We then proceed as usual with the other columns of 
the table, squaring the values of X, and also multiplying each 
value of log 7 by the corresponding value of X. The sums of 
these columns are then substituted in the normal equations to 
give 

20a + 96.16 « 20.086 
96.1a + 609.95b = 81.937 

Solving these two equations we find that a = 1.477 and 

b = -0.0984 

Our equation is, then, 

log 7 = 1.477 - 0.0984A" 

We can use this equation directly, if we wish, to estimate values 
of 7 from known values of X. For example, if we want to esti¬ 
mate the atmospheric pressure at an altitude of 5 miles, we write 

log 7 = 1.477 - 0.0984(5) - 1.477 - 0.492 = 0.985 
7 - 9.66 

Our estimate is that the pressure is 9.66 in. at an altitude of 5 
miles. If we want, we can convert our result to form (2), remem¬ 
bering that a = log A and that b = log B. Since we know now 
that a — 1.477 and that 6 = —0.0984, we merely take anti¬ 
logarithms to get 

7 - (30) (0.797) x 

We interpret these figures to mean that, while air pressures vary 
from many causes, the general underlying tendency is for the 
pressure to equal 30 in. at sea level, and to drop 21.3 per cent for 
every mile of altitude. 

12.6. The Concept of Half-life.—A moment’s thought will 
show that if the air pressure has a value of 30 at sea level, and 
drops 21 per cent for each mile of altitude, it will never fall quite 
to zero. The semilogarithmic curve falls closer and closer to the 
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Table 12.4. —Computation of Least Squares Semilooarithmic Curve 
from Data of Table 12.3 


Altitude 

(miles) 

(X) 

Pressure 

(inches) 

(Y) 

(log Y) 

X* 

X log Y 

0 0 

28 9 

1 461 

0.00 

0.000 

0.8 

26.9 

1 430 

0 64 

1 144 

1 4 

20 4 

1 310 

1 96 ! 

1 834 

1 8 

21 9 

1.340 

3.24 

2 412 

2 6 

16 2 

1.210 

6 76 

3 146 

2 6 

19 1 

1 281 

6 76 

3.331 

3 3 

13 2 

1 121 

10 89 

3 699 

3 7 

13 2 

1 121 

13 69 

4 148 

4 0 

11 2 

1 049 

16 00 

4 196 

4 7 

11 2 

l 049 

22 09 

! 4 930 

4 7 

8 5 

0 929 

22 09 

| 4 366 

5 4 

10 0 

1 000 

29 16 

5 400 

5 6 

7 6 

0 881 

31 36 

4 934 

6 8 

6 9 

0 839 

46 24 

5 705 

7 1 

6 2 

0 792 

! 50 41 

5 623 

7.3 

4 8 

0 681 

53 29 

4 971 

8 0 

5 6 

0 748 

64 00 

5 984 

8 4 

4 1 

0 613 

70 56 

5 149 

8 4 

4 6 

0 663 

70 56 

5 569 

9 5 

3 7 

0 568 

90 25 

5 396 

Totals. 96 1 


20 086 

! 609 95 

i 81 937 


base line, but never quite reaches it. If at each height we have a 
pressure which is still 79 per cent of the height a mile below, there 
is still some pressure left. Thus although the curve continues to 
drop, it drops more and more slowly, and we cannot say how high 
one would have to go to reach the top of the atmosphere. We can 
easily tell, however, how high we would have to go to pass through 
half the atmosphere, or, in other words, to reach the point where 
the air pressure was half that at the ground. Since Y — AB X , 
we see that when X is zero, Y = A, We want to find the value 
of X when Y is half as large, or when Y = A/2. Since Y = AB X , 
we can write AB X = A/2 or B x = Consequently, 

X(log B) = log 04) = —0.30103 
y _ 0.30103 

lo gB 
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Since our least-squares estimate of B was 0.797, and our value of 
b has already told us that log B is —0.0984, we can write 


X 


0.30103 

-0.0994 


3.06 


We estimate that the air pressure will be cut in half every 3.06 
miles that we rise. 

This concept is very commonly used in the physical sciences. 
For example, uranium slowly disintegrates into lead, and the dis¬ 
integration follows the semilogarithmie law. It takes 66 million 
years for 1 per cent of the uranium to be transformed into lead, 
or, what amounts to the same thing, after a million years 99.9848 
per cent will still be uranium, while the other 0.0152 per cent will 
have been changed to lead. The rate of decomposition, then, is 


Y = 0.999848* 

where Y is the percentage of uranium and X is the time in millions 
of years. If a geologist analyzes a rock and finds the proportion 
of uranium and lead, he can tell from the formula how old the 
rock is. This is one of the ingenious methods developed for ascer¬ 
taining the age of the earth's crust. Using our formula we can 
say that the half-life of uranium is —0.30103/log .999848 == 4555 
million years. We might better round this off to 4600 million or 

4.6 billion years. When the scientist says that uranium has a 
half-life of 4.6 billion years, the student is likely to wonder why 
he does not merely double the figure and say that it has a life of 
9.2 billion years. The answer is that, if half of the uranium is 
converted each 4.6 billion years, there will be half of it left after 

4.6 billion, a quarter of it left after another 4.6 billion, an eighth 
of it left after the third 4.6 billion years, etc. It will never be 
entirely converted. Similarly, scientists have discovered that 
growing trees store up radiocarbon, which slowly disintegrates 
into ordinary carbon after the tree dies. Radiocarbon has a 
half-life of 5568 -F 30 years, 1 and by analyzing pieces of wood 
taken from ancient ruins the scientist can tell with reasonable 
accuracy when they were built. The rate of decomposition is 

1 The figure 30 is the standard error of the determination. The student 
should understand what this means. The data are taken from J. R. Arnold 
and W. F. Libby, Radiocarbon Dates, Science, Vol. 113, No. 2927, Feb. 2, 
1951, p. 111. 
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Y = 0.99988* 


where Y is the percentage of radiocarbon and X is the number of 
years since the tree died. For example, when a piece of char¬ 
coal from Lascaux Cave near Montignac was studied, it was dis¬ 
covered that about nine-tenths of 1 per cent of the radiocarbon 
had been converted to ordinary carbon. This indicated that the 
wood was between 15,000 and 16,000 years old. When our equa¬ 
tion is in the form Y = AB X , we find the half-life from the 
formula 


Half-life - 


0.30103 
log B 


If we have been using the equation in the form log Y « a 4- hX, 
we find the half-life from the formula 


Half-life = 


0.30103 

b 


As a final example, what is the half-life of a function which 
decreases 10 per cent each year? Evidently at any time it will be 
90 per cent of its value a year earlier, or Y = A (0.9)*, and the 
half-life is -0.30103/log 0.9 = -0 30103/-0.0458 - 6.57. The 
half-life is 6.57 years. After 6 years and 208 days the function 
will have dropped to half its original value, and each 6.57 years 
will see it cut in half again, slowly dwindling toward nothing but 
never getting there. 

12.7. The Logarithmic Curve. The type equation of this curve 
may also be written in two forms. If we put it in the form 
making it most easily comparable with our other curves, we write 
it 


(1) log Y = a + b(log X) 

This is like the formula for a straight line except for the fact that 
we have used logarithms of Y and of X where the straight line 
has values of Y and X themselves. This function is a curve when 
we plot values of Y and X, but it becomes linear when we plot 
values of log F and log X. 

The type equation may alternatively be written 


(2) 


F - A X s 
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Again we use capital letters for the parameters merely to make 
sure that the student does not suppose that they take the same 
values as a and b in form (1). If we take logarithms of both sides 
of (2), we get 

(3) log Y = log A + B log X 

which is the same as form (1) save that we have written log A for 
a and B for b. Evidently the a of form (1) is the logarithm of A 
in form (2), and the b of form (1) 
is identical with the B of form (2). 

Thus we can shift back and forth 
from the one form to the other if 
we wish. 

The logarithmic curve has the 
general form shown in Fig. 12.11. 

The curve takes the positive slope 
of the solid curve in the figure 
when the value of b is positive in 
form (1) or when the value of B is 
positive in form (2). When these 
values are negative, the curve is 
like the negatively sloping broken curve of Fig. 12.11. Both 
curves are monotonic, either rising or falling throughout their 
entire length. 

With a straight line, a given amount of change in X always 
brings a constant amount of change in Y. For example, every 
extra year of age adds 0.344 mm. to the blood pressure (see Sec. 
11.10). With a semilogarithmic curve, a given amount of change 
in X always brings a constant percentage change in Y. For 
example, every added mile of altitude decreases the atmospheric 
pressure by 21.3 per cent (see the last sentence of See. 12.5). 
But with the logarithmic curve, which we are now studying, a 
given percentage change in X always brings a constant percentage 
change in Y. With the straight line, both changes are in quan¬ 
tity or amount. With the logarithmic curve, both are in per¬ 
centage. With the semilogarithmic curve, X is in amounts and 
Y is in percentages. 

We shall illustrate the fitting of this curve with data showing 
the distances of the planets from the sun and their periods of 
revolution. The figures are in Table 12.5. Distances are in 



Fia. 12 11. Typical logarith¬ 
mic curves. 
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Table 12.5. —Distances of Planets from Sun and Periods of 


Revolution 

Planet 

Distance 

(astronomical 

units) 

Period 

(years) 

Mercury 

0 39 

0 24 

Venus. 

0 72 

0.62 

Earth 

1 00 

1.00 

Mars 

1.52 

1.88 

Jupiter . 

5.20 

11.9 

Saturn. 

9.54 

29 5 

Uranus 

19.2 

84 

Neptune 

30.1 

165 

Pluto. 

39 5 

248 


astronomical units, or in terms of the distance from the earth to 
the sun. Times are in years. It is difficult to make any satis¬ 
factory graph of these data, because of the extreme values of the 
variables. The largest distance is about 100 times the smallest 
distance, and the longest period about 1000 times the shortest 
period. We shall not try to show the data on a graph for the time 
being, although later we shall learn how the matter can be 
handled (see Fig. 12.19). Instead of using values from a free¬ 
hand trend, which would be more accurate, we shall this time 
take values of X and Y directly from the table, letting X represent 
distance and Y represent period in years. 

With the method of selected points, using our type equation 
Y = AX B , we substitute first the values X = 0.39 and Y = 0.24, 
and then the values X = 39.5 and Y = 248. This gives us the 
observation equations 

0.24 = 4(0.39)* 

248 = 4(39.5)* 

Dividing the second equation by the first, we get 

248 _ 4(39.5)* 

0.24 4(0.39)* 

1032 « (110.2)* 
log 1032 = J5(log 110.2) 

3.01368 = 2.00518B 
_ 3.01368 
2.00518 
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Thus B has a value of almost exactly ^ or 1.5. Substituting 
this value back in our equation 248 = ^4(39.5)*, we get 248 = 
A(39.5)**. 

log 248 = log A + (%)(log 39.5) 

2.39445 = lo gi + (^) (1.59660) 

2.39445 = log A + 2.39490 
log A * -0.00045 

Or for practical purposes A is equal to 1. We can write our 
equation, then, 

Y = X H or F 2 - X s 

If we prefer to use the method of least squares we use the two 
normal equations 


Na + bX log X — X log Y 
aX log X + bX {log X) 2 = 2(log X)(log Y) 

These equations are again exactly like the normal equations for 
straight lines except that we have used log X instead of X and 
log Y instead of Y ---which is what we would expect when we 
recall that the logarithmic function is linear in log X and log Y 
instead of in X and Y. 


Table 12.6.— Computation of Least Squares Line from Data of 
Table 11.5 with a Logarithmic Curve 



A' 1 

Y 

j log X 

log 7 

(log XV 

(log X) (log Y) 

Mercury 

0 39 

0 24 

-0 409 

-0.620 

0 167 

0 254 

Venus 

0.72 

0 62 

-0.143 

-0 208 

0 016 

0 030 

Earth 

1 00 

1 00 

0.000 

0 000 

0.000 

0 000 

Mars 

1.52 

1 88 

0.182 

0.274 

0 033 

0.050 

Jupiter 

5.20 

11 9 

0 716 

1 076 

0.513 

0 770 

Saturn 

9.54 

29.5 

0 980 

1.470 

0.960 

1 441 

Uranus 

19.2 

84 

1 283 

1 924 

1.646 

2 468 

Neptune 

30.1 

165 

1 479 

2 217 

2 187 

3 279 

Pluto. 

39 5 

248 

l 597 

2 394 

2 550 

3 823 

Totals. 

107 17 

542 14 

5.685 

8.527 

S 072~ | 

12.115 


To get the values for substitution in our normal equations, we 
set up Table 12.6, where the first two columns are taken from 
Table 12.5 and the other columns are derived from them. We 
must remember that the logarithms of numbers between zero and 
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1 (that is, of decimals) are negative, and in adding the columns 
of logarithms we must add algebraically, keeping track of signs. 
If we substitute the totals from the table in our normal equations, 
we get 

9 a + 5.6856 = 8.527 
5.685a + 8.0726 = 12.115 

Solving these two equations tells us that a — 0.00067 and 6 = 
1.501. Our equation is, then, 

log F = 0.00067 + 1.501 (log X) 

Again if we round off, we get almost exactly 

log V = 1.5(log X) 

Remembering that when we transfer to the alternative expression 
of the type equation where Y = AX B , we have a = log A and 
b *= B y we can now write 

Y = (1)(I) U = X H or Y 2 ~ X s 

This time our answer by the method of least squares was for 
practical purposes identical with that found by selected points. 

We can now use this equation to estimate the period of a planet 
at any distance. For example, knowing that the distance of 
Jupiter is 5.2 astronomical units, what is its period? F 2 = X 3 , 
so we can write F 2 = 5.2 s = 140.6 and F = 11.86, or 11.9 years. 
If a planet were discovered 50 astronomical units from the sun, its 
period would be 

F 2 = 50 3 = 125,000 and F = 353.6 years 

12.8. The Reciprocal Curve.—The last of the families of curves 
which we shall consider has the type equation 

y = a + bX 

It has the general form shown in Fig. 12.12. When the value of 
b in the type equation is positive, the curve has a negative slope 
like that of the solid curve in the diagram; when 6 is negative, the 
slope of the curve is positive like that of the broken curve in the 
diagram. It is evident that the type equation is similar to that 
for the straight line except that we have used 1/F instead of Y . 
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While this function yields a curve when we plot values of X and 
F, it yields a straight line if we plot values of X and 1/F. 


We shall fit this curve to the 
the potato production in cer¬ 
tain states and the price of 
potatoes at two Middle Western 
markets from 1906 -1918. A 
scatter diagram of these values, 
with a freehand regression line, 
appears as Fig. 12.3 on page 
342. As usual, we read two 
points from our line, selecting 
points near the ends. Reading 
roughly from the diagram we 
estimate that when X — 2.0, 

Y = 4.00; and when X = 3.6, 

Y = 1.25. Substituting these 
1/F = a + 6X, we get 

1 = 
4.00 ' 

J_ = 

1.25 * 


data of Table 12.7, which shows 



values in our type equation 


+ 2.0 6 
+ 3.66 


Solving these for a and 6, we get a = —0.438 and 6 «= 0.344. Our 
reciprocal equation is, then, 1/F = —0.438 + 0.344X. 

To fit the curve to these data by the method of least squares, 
we use the two normal equations 


a 


Na + b Y X 


= 2(0 
-240 


To compute the necessary values we put our data into the form 
of Table 12.8. Instead of using the values of F, or price, w r e take 
their reciprocals, and list the values of 1/F. We substitute the 
totals from the table in our normal equations to get 


13a + 36.96 « 6.9635 
36.9a + 106.836 = 20.488 
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Table 12.7.— Potato Production in 27 Late-crop States and Potato 
Prices at Minneapolis and St. Paul, 1906-1918 


Year 

Production 
(100 million 
bushels) 

Price ($ per 
200 lb.) 

1906 

2.7 

1.6 

1907 

2.6 

2.0 

1908 

2.3 

2.6 

1909 

3.2 

1.2 

1910 

2.7 

2.0 

1911 

2.5 

3.1 

1912 

3.5 

1.3 

1913 

2.7 

2.1 

1914 

3.5 

1.4 

1915 

2.8 

2.0 

1916 

2.2 

3.9 

1917 

3.2 

1.9 

1918 

3.0 

1.8 


Solving these two equations simultaneously, we get 
a — —0.445 and b = 0.34(> 

Our regression equation is 

Y = -0.445 + 0.346X 

Now let ub estimate the most probable price for a production of 
280 million bushels. This production makes X equal 2.8, which 
gives us 

-p = -0.445 + 0.346(2.8) = 0.524 
Y = 1.91 

Our estimated price is, then, $1.91 per 200 lb. As before, we can 
estimate the price for several conveniently spaced productions, 
locate these estimates on our scatter diagram, and draw a smooth 
curve through the estimated points. This will give us a picture 
of our reciprocal regression line. Such a curve appears in Fig. 
12.13. 

The student who wishes practice on curve fitting may be inter¬ 
ested to fit a second-degree parabola also to the data of Table 
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12.7* By the method of least squares he will get as his regression 
line Y » 18.9 — 10.415X + 1.549X 2 . Recalling the rules of 
Sec. 12.3 he will realize at once that the curve is concave upward, 
as it should be. The curve will also pass through a minimum 
point where X » 10.415/3.098 = 3.36. This means that a 
second-degree parabola will indicate a minimum price when the 
production is 336 million bushels. The curve will indicate that, 
at higher productions than this, the price will rise as the produc¬ 
tion increases. The fact that the curve has been fitted by mathe- 



Producf/on, hundreds of millions of bushels 

Fig. 12.13. Late-erop pot ato production and Twin Cities price, 1906-1918, 
with reciprocal regression line. Data from page 368. 

matieal methods does not make this result sensible or the esti¬ 
mates trustworthy. It is quite as possible to fit a second-degree 
parabola to the data of Table 12.7 as it is to fit a reciprocal curve. 
In fact, one could fit all the curves which we have described, and 
many more. But what we are after is a description of what 
really happens, and the line which describes the facts is actually 
the line of “best fit/ 7 No amount of slavish following of mathe¬ 
matical formulas can substitute for common sense and under¬ 
standing in statistical work. 

12.9. Deciding the Type of Curve to Fit.—We have left until 
last one of the problems which the practicing statistician actually 
faces early in the game, namely, how can we decide which of the 
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families of curves to fit to our data? Given a scatter diagram, 
for example, which shows a rough tendency for a negative curvi¬ 
linear relationship which is concave upward, how can we tell 
whether to fit a reciprocal curve, a second-degree parabola, a 
logarithmic curve, or a semilogarithmic curve? Our choice 
will have to be in large part subjective; yet if we choose the 
wrong one, no amount of objective and careful labor can yield 
results which are as useful as they might be. 


Table 12.8.— Fitting the Recipbocal Regression Line to the Data of 

Table 12.7 


Year 

Production 

(X) 

Reciprocal 
of Price 

(1 /Y) 

X 2 

mm 

1906 

2 7 

0 6250 

7 29 

1 68750 

1907 

2.6 

0 5000 

6 76 

1.30000 

1908 

2 3 

0 3846 

5 29 

0 88458 

1909 

3.2 

0 8333 

10 24 

2.66667 

1910 

2.7 

0 5000 

7 29 

| 1.35000 

1911 

2.5 

0 3226 

6 25 

0 80650 

1912 

3.5 

0 7692 

12 25 

2 69220 

1913 

2.7 

0 4762 

7 29 

1.28574 

1914 

3 5 

0.7143 

12 25 

2 50000 

1915 

2.8 

0 5000 

7 84 

1.40000 

1916 

2.2 

0 2564 

4 84 

0 56408 

1917 

3.2 

0 5263 

10 24 

1 68416 

1918 

3.0 

0 5556 

9 00 

1 66666 

Totals. 

36 9 

6.9635 

106 83 

20 48809 


Sometimes, to be sure, our knowledge of the subject matter 
with which we are working leads us to believe that a particular 
sort of relationship is logical. For example, the student of 
economics who is working with compound interest may decide 
even without looking at his data that they should increase by a 
constant percentage rate. In such a case he should fit a semi- 
logarithmic curve. Similarly Newton, starting with Kepler's 
laws, reasoned that there must theoretically be a certain kind of 
relationship between the attraction of bodies for each other, their 
masses, and the distance between them. His great universal 
law of gravitation had to take a definite mathematical form to 
jibe with principles already known. But it is very uncommon 
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for the student to have such assistance. In almost every case he 
will be entirely devoid of any a priori idea as to the nature of the 
relationship, and he will have to rely on other indications and 
tests which we shall now enumerate in Secs. 12.10 to 12.13. 

12.10. Comparison with Type Curves.—The first step in 
studying any relationship is to plot the data on a scatter diagram 
and inspect them. If the number of cases is large, it may help 
to divide them into classes according to the value of X and find 
the average value of Y within each class. These averages, when 
plotted on a scattergram, will often show a more regular 



pattern than do the individual observations. Having studied the 
arrangement of the dots, the statistician proceeds to draw care¬ 
fully a freehand curve through them, indicating as we\l as he can 
the general trend or pattern. This curve he then compares with 
type curves for the various families—curves similar to those of 
Figs. 12.4, 12.8, 12.9, 12.11, and 12.12. Until he has had a 
good deal of experience with such curves, it will pay him to draw 
large-scale type curves of his own, starting with the type equation 
and substituting values of X and Y to get the data for his original 
diagram. This comparison is not final, but may well help him 
right at the start to exclude certain curves, and reduce the num¬ 
ber of possibilities to two or three. 
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12 . 11 . The Process of Differencing.—Having made a scatter 
diagram, and having drawn thereon a careful freehand regression 
or trend line, we may next use the line to estimate values of F at 
equally spaced values of X . To translate this into other wording, 
we may erect perpendiculars at evenly spaced points along the 
base line, and read the heights at which they cut the curve. 
Thus if we have the curve of Fig. 12.14 we might read the heights 
of the curve at the points where X = 0, 2, 4, 6, 8, 10, 12, 14, and 
16, or at any other equally spaced X values. If we do read the 
heights of the curve in the figure (and naturally our readings 
would be more accurate if taken from a large-scale diagram on 
cross-section paper than they can be from a small rough diagram 
in a textbook), we get the values in the first two columns of 
Table 12.9. Each figure in the third column, labeled AT, is a 

Table 12 . 9 .— Differencing Data from Fig. 12.14 


X 

F 

AY 

A 2 F 

0 

8.0 



2 

11 6 

3 6 


4 

14 4 

2 8 

-0 8 

6 

10 4 

2.0 

-0 8 

8 

17 6 

1.2 

-0 8 

10 

18 0 

0.4 

! -0 8 

12 

17.6 

-0 4 

-0 8 

14 

16.4 

-1 2 

j -0 8 

16 

14 4 

-2 0 

-0 8 


difference between two figures in the preceding column. We get 
any figure in the AF column by subtracting from the figure at its 
left the figure next above in the column at the left. For example, 
the first item in the AF column is 3.6. It was found by sub¬ 
tracting 8.0 from 11.6, both of these figures appearing in the 
second column. Similarly the second value in the AF column, 
2.8, was found by subtracting 11.6 from 14.4, both of which are 
in the second column. Thus we get the column of first differ¬ 
ences. The expression “AF” does not mean A times F, as one 
might expect from algebra. The A is a Greek letter Z>, and 
stands for “difference,” so that we may read the expression AF to 
mean “difference in F.” 
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Having now obtained the third column, we go through the dif¬ 
ferencing process over again to get the fourth and last column, 
headed A 1 2 3 F. This does not give the squares of the differences, 
as might be supposed, but is, rather, the “ second differences,” 
or differences between the differences. Thus if we subtract the 
first number in column three from the second number we get 
2.8 — 3.6 = —0.8, which is the first entry in the fourth column. 
If we subtract the second number from the third we get 

2.0 - 2.8 - - 0.8 

which is the second number in the fourth column. In fact, on 
this particular occasion the second differences are constant—all 
equal to —0.8. Such an equality of second differences would not 
be likely to happen by chance, and in Table 12.9 it happened 
because the curve of Fig. 12.14 follows a definite mathematical 
law. But for the moment we are interested in understanding 
the process of differencing. If we now took still another step, 
and took the differences between the second differences, we 
would get a column of third differences labeled Aetc. 

Note that we are taking values, not from an original scatter 
diagram, but from a freehand curve drawn upon such a diagram. 
Since this is true we can easily get evenly spaced values of X 
even if our original data were not evenly spaced. In time series 
we usually have such evenly spaced values anyway, and students 
often make the error of taking differences between the successive 
values of Y originally furnished them. One should take differ¬ 
ences between Y values read from the curve, and since it is just 
as easy to select these values at equally spaced points along the 
base line, and since such equal spacing reduces our arithmetic 
and simplifies our rules, we shall continue our discussion with the 
assumption that the values of Y which are being differenced are 
taken from equally spaced values of X. 

Now that we understand differences, let us lay down a few 
simple rules. 

1. If the first differences in Y axe constant, the line is a straight line. 

2. If the second differences in Y are constant, the Kne is a second-degree 
parabola. (This is the case of Table 12.9, whose values were computed 
from the equation Y « 8 -f 2X — 0.1X 2 .) 

3. If the third differences of Y are constant, the line is a third-degree 
parabola. 
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4 . If the nth differences of Y are oonstant, the line is an nth-degree 
parabola. 

5. If we find the logs of the Y values, and take the differences in these 
logarithms, and if A log Y is constant, the line is a semilogarithmic curve. 

6. If we find the reciprocals of the Y values, and take the differences in 
these reciprocals, and if A(1 /Y) is constant, the line is a reciprocal curve. 

7. The rule for the logarithmic curve is more complicated. Take a 
series of X values and the corresponding Y values. Find the logs of the 
X values and the logs of the Y values. Find the differences in the logs of 
X and the differences in the logs of Y. Then divide the values of A (log X) 
by the corresponding values of A(log Y). If A(log A")/A(log Y) is constant, 
the curve is a logarithmic curve. This is complicated enough to require 
illustration, and we illustrate it in Table 12.10 with the data of Table 12.6. 


Table 12.10.— Recognition of Logarithmic Relationship by Ratio of 
Differences between Logarithms of X and of Y 



AT 

Y 

log X 

log Y 

A 

log AT 

A 

log Y 

A log X/A logK 

Mercury .... 

0 39 

0 24 

-0.409 

-0 620 




Venus. 

0.72 

0.62 

-0 143 

-0.208 

266 

412 

645 

Earth. 

1.00 

1.00 

0.000 

0.000 

.143 

.208 

688 

Mars . 

1.52 

1.88 

0 182 

0 274 

.182 

274 

664 

Jupiter. 

5.20 

11 9 

0.716 

1.076 

534 

802 

666 

Saturn. 

9.54 

29.5 

0 980 

1 470 

264 

394 

671 

Uranus . 

19 2 

84 

1.283 

1 924 

303 

454 

668 

Neptune. 

30.1 

165 

1.479 

2 217 

196 

.293 

669 

Pluto. 

39.5 

248 

1 597 

2.394 

118 

j 177 

; 667 


The first two numerical columns give the original values of X 
and F. Then come two columns of the logarithms of these 
numbers. Then come two columns of differences in the loga¬ 
rithms. Finally each figure in the fifth column is divided by the 
corresponding figure in the sixth column to get the entry in the 
final column. 

Since our original curve was drawn freehand, and since no two 
men would draw it in exactly the same place, and since it is sub¬ 
ject to errors in reading, we cannot expect the rules on differ¬ 
encing to work out with complete accuracy. We do not expect 
to get the complete equality of differences which appeared in the 
final column of Table 12.9 except in those cases where the original 
data were computed for illustrative purposes. We really expect 
to get only approximate equality, often not even so good as that 
in the last column of Table 12.10. While we do not expect to 
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find all the differences exactly the same, we do insist that they 
should show a random pattern—that they should not get slowly 
larger and larger or smaller and smaller. If they are approxi¬ 
mately the same size, and show no regular progression of values, 
then we feel that the particular curve type is worth experimenting 
with. The complete equality of second differences in Table 12.9 
would make us certain that the curve was a second-degree parab¬ 
ola. The very close agreement of the ratio of differences in 
logarithms in Table 12.10 would make us confident that the 
original curve was logarithmic (and, to be sure, we did fit a 
logarithmic curve to this body of data in Sec. 12.7). 

Taking differences between the logarithms of numbers is, of 
course, analogous to finding the quotient of the numbers. We 
can restate our rule for the semilogarithmic curve (rule 5 above), 
therefore, by noting that if the X values are evenly spaced, we 
can divide each Y value by the one which follows, and if the 
quotients are constant or approximately so, the relationship is 
semilogarithmic. Thus, if we have evenly spaced X values, and 
our Y values are 20.0, 23.0, 26.5, 30.5, 35.1, 40.3, and 46.4, we 
can divide each number by the one which follows, and note that 
in each case our quotient is approximately 0.87. This approxi¬ 
mate equality of quotients indicates that the Y values form a 
geometric progression, and can be fitted by a semilogarithmic 
curve. With this semilogarithmic distribution some authorities 
would urge that we should use the geometric mean in preference 
to the arithmetic mean. 

12.12. Plotting “Distorted Data.”—When we plot our scatter 
diagram we can see easily in most cases where there is a close rela¬ 
tionship whether it is linear or curvilinear. If it is linear, we fit 
a line of the general form Y = a + bX. But if it is curvilinear, 
and if we think we know what its family might be, we can test 
it by plotting an entirely new scatter diagram in which, instead 
of plotting values of X and F, we “ distort *’ one or the other of 
them by converting it to another form before plotting. Thus 
instead of using values of X and F, we may plot values of log X 
or of log F or of 1/F. This is useful in the following cases: 

1. If we get a straight line when we plot X and log F, the semilogarithmic 
curve should be used. 

2. If we get a straight line when we plot log X and log F, the logarithmic 
curve should be used. 
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3. If we get a straight line when we plot X and 1/F, the reciprocal curve 
should be used. 

Let us illustrate with the values of Fig. 12.10. It is apparent 
that the original values of X and F, which are plotted there, yield 
a curve and not a straight line. But if we plot instead the value 
of X and of log Y (these values appear in Table 12.4 on page 360) 
we get Fig. 12.15. Now the curve has disappeared, and the dots 
seem to be arranged in a straight line. From rule 1 above we 
know that the relationship must be semilogarithmic. 



Fig. 12.15. Data on altitude and air pressure “distorted ” to give a straight- 
line. Compare with Fig. 12.10 on page 357. Data from Table 12.4 on 
page 360. 

12.13. Plotting Data on Distorted Axes.—Instead of “distort¬ 
ing” the data by converting them to logarithms or reciprocals, 
we may plot the original data on special forms of graph paper. 
Instead of using the ordinary cross-section paper, on which the 
lines are spaced equally, we use graph paper which has been 
especially prepared so that the lines are spaced, not in proportion 
to the numbers, but in proportion to the logarithms or the 
reciprocals of the numbers. 

With ordinary graph paper we might lay off an axis 5 in. long, 
and divide it into 10 subdivisions by points located every half 
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inch along the line. The divisions would be equal. But in 
contrast, if we want a logarithmic scale, we lay off a line which 
is 100 units long (this might, of course, again be a line 5 in. long, 
in which case there would be 100 units, each of 0.05 in.) and mark 
off divisions at the distances which represent the logarithms of 
the numbers. For example, since the logarithm of 1 is zero, we 
put the figure 1 at the zero point on the scale. Since log 2 is 
.301, we put the figure 2 at a point 30.1 units along the scale. 
Since log 3 is .477, we put the figure 3 at a point 47.7 units along 
the scale. And since the log of 10 is 1.000, we put the figure 10 
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Fig. J2.16. Semilogarithmic paper. 

at a point 100.0 units along the scale, which is at the extreme 
far end of our original 100 units. Our scale is now divided 
logarithmically rather than arithmetically. It is the* same scale 
of divisions which we find on a slide rule. 

If we want a reciprocal scale we proceed in a similar manner, 
placing our numbered divisions at the points which correspond 
to the reciprocals of the numbers. Figures 12.16 to 12.18 show 
the three most common of these forms of “distorted axes,” where 
one axis or both has been warped out of its usual form. Figure 
12.16 has the usual arithmetical axis horizontally. The vertical 
lines are equally spaced. But the horizontal lines are spaced 
unevenly, with their distances being proportionate to the loga- 
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scales can be bought ready prepared from many stationers, or 
they can easily be prepared by the student himself who has read 
the preceding paragraph. 

These “distorted axes” are useful in recognizing certain of the 
curves which we have studied in this chapter. If we think that 
one of these curves is indicated, we plot our original data on the 
appropriate form of graph paper, and if the dots now fall along 



Distance in astronomical units 

Fig. 12.19. Data plotted on logarithmic paper, indicating by their straight- 
lme arrangement that the function is logarithmic. 

a straight line, or in a straight band, we know that we have dis¬ 
covered the nature of the underlying law in accordance with the 
following rules: 

1. Use a semilogarithmic curve if the data form a straight line on semi- 
logarithmic paper. 

2. Use a logarithmic curve if the data form a straight line on logarithmic 
paper. 

3. Use a reciprocal curve if the data form a straight line on reciprocal 
paper. 
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These rules are illustrated with Fig. 12.19, where the data of 
Table 12.5 are plotted on logarithmic scales. While we found 
it difficult to plot these data at all in Sec. 12.7, we now manage 
easily to compress them within the limits of a diagram, and we 
see at once that the points lie along a straight line rather than 
along a curve. Since the data form a straight line when both 
scales are logarithmic, we know that we should fit a logarithmic 
curve, as we did in Sec. 12.7. 

12.14. Summary.-—The scientist is often interested in dis¬ 
covering the existence of relationships, and, where they exist, in 
determining their nature and deciding how to describe them. 
To this end he follows certain steps which aid him in deciding 
which of several possible simple mathematical laws he can use 
to give a reasonably satisfactory description of the general pattern 
of relationship. These steps are as follows: 

1. If there is some law which we know on a priori theoretical grounds 
should govern the data, we use the appropriate mathematical equation 
If not (and there usually is no such theoretical basis) we proceed with the 
following steps. 

2. We plot the original data on a scatter diagram and draw a careful 
freehand curve to show the general pattern. 

3. We compare this curve with standard type curves for the various 
curve types, such as those of Figs. 12 4, 12.8, 12.9, 12.11, and 12.12. While 
this will seldom tell us with any certainty what curve to fit, it will often 
help us to eliminate some curves from consideration and to narrow down 
the possibilities to two or throe; curves We then try the appropriate ones 
of the following tests: 

a. Differencing with equally spaced X values. 

(1) If first differences of Y arc constant, use a straight line of form 
Y - a + bX. 

(2) If second differences of Y are constant, use a second-degree 
parabola of form Y » a -f bX -f cX* 

(3) If third differences of Y are constant, use a third-degree parabola 
of form Y = a + bX + cX* + dXK 

(4) If rath differences are constant, use an nth-degree parabola. 

(5) If first differences in log Y are constant, use a semilogarithmic 
curve of form log Y = a -f bX. 

(6) If first differences in (3 /Y) are constant, use a reciprocal curve 
of form 1/Y = a bX. 

(7) If values of A log X /A log Y are constant, use a logarithmic curve 
of form log Y a 4- b log X. 

b. Plot distorted data on ordinary graph paper. 

(1) If a chart of values of X and log Y gives a straight pattern, use a 
semilogarithmic curve. 
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(2) If a chart of values of log X and log V gives a straight pattern, 
use a logarithmic curve. 

(3) If a chart of values of X and l/Y gives a straight pattern, use a 
reciprocal curve. 

c. Plot original data on distorted axes. 

(1) If this yields a straight pattern on semilogarithmic paper, use a 
semilogaritbmic curve. 

(2) If this yields a straight pattern on logarithmic paper, use a loga¬ 
rithmic curve 

(3) If this yields a straight pattern on reciprocal paper, \ise a reciprocal 
curve. 

4. Having found the appropriate type of curve from these tests, we “fit” 
the curve, finding the parameters for our own special problem, as follows: 

а. By the method of selected points. 

(1) Use as many points as there are parameters in the desired type 
equation. 

б. By the method of least squares. 

(1) There will be as many normal equations as the number of param¬ 
eters in the desired equation. 

5. Having found the equation, we can often interpret the parameters in it. 
For example, they may tell us the amount of change in Y for each unit 
change in X , or the percentage rate of change in Y for each unit change in X , 
or the percentage rate of change in Y for each 1 per cent change in X, etc. 

6. The equation can now be plotted on a scatter diagram, by computing 
the values of Y which correspond to certain selected values of X, and plotting 
these X and Y values on the diagram. 

The rules which we have just given are those which are used 
in practice to find scientific laws. When one finds in a textbook 
a formula for the law of gravitation, or the law of falling bodies, 
or the law of the pendulum, or Hooke’s law, or Kepler’s laws, 
etc., he may wonder how anyone ever happened to find these 
involved relationships. For example, Kepler’s third law states 
that “the squares of the periods of revolution of the planets about 
the sun are in the same ratio as the cubes of their mean distances.” 
The student may be convinced that the statement is true, and 
still wonder how anyone ever happened to stumble on it. The 
answer is that the discovery of this law was not a matter of 
chance. Having observed the data of Table 12.5 one merely 
follows the methods summarized in this section and finally dis¬ 
covers, as we did in Sec. 12.7, that Y 2 = X s when Y is the period 
in years and X is the distance in astronomical units. He sees 
immediately that the squares of the periods vary with the cubes 
of the distances, which is Kepler’s law. Similarly, in this and 
in the preceding chapter, we have gone through the steps which 
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would be necessary for establishing several other “laws,” 
although it is true that in many of the cases we have not covered 
large enough bodies of data to be very sure of our results. Still, 
the method has been there, and if we had had data enough we 
could have added to our “discoveries” the laws that 

1. Between the ages of 15 and 65, the blood pressure is normally equal 
to 114 plus one-third the age (see Sec. 11.9). 

2. Atmospheric pressure, which is about 30 in. at sea level, tends to drop 
about 21.3 per cent for every mile of altitude, and to be cut in half for each 
rise of 3.06 mile (see Secs. 12.5 and 12.6). 

3. Between 1908 and 1918 the price of potatoes in the Minneapolis 
St. Paul market varied inversely with the volume of production in 27 late- 
crop states according to the general formula 

I « -0.445 + 0.346X 

(see Sec. 12.8). 

4. Between 1906 and 1928 the production of crude petroleum in the 
United States tended to rise, slowly at first and then more and more rapidly, 
in accordance with the general formula 

Y - 366.2 + 18.5X + 0.423X 2 
Origin 1917.5; deviations in half years 

(see Sec. 12.3). 

12.16. Suggestions for Further Reading.- The general problem of fitting 
simple curves is well treated by G. U. Yule and M. G. Kendall in “An 
Introduction to the Theory of Statistics/* 11th ed., Chap. 17, Charles 
Griffin & Co., Ltd., London, 1937. There is some discussion of the reasoning 
lying behind various sorts of trend and regression formulas m J. G. Smith 
and A. J. Duncan, “Elementary Statistics and Applications,” Chap. 21, 
McGraw-Hill Book Company, Inc., New York, 1944. “Statistical Methods 
Applied to Economics and Business,” by F. C. Mills, Henry Holt and Com¬ 
pany, Inc., New York, revised 1938, Chap. 7, has a good and simple dis¬ 
cussion of the simpler curves. Chapter 7 of John F. Kenney’s “Mathe¬ 
matics of Statistics,” Part I, D. Van Nostrand Company, Inc., New York, 
1939, is excellent but mathematical. It includes some discussion of the 
Gompertz curve, useful in describing population statistics, m addition to 
the curves which we have described. Chapters 10 and 11 of G. R. Davies 
and D. Yoder, “Business Statistics,” 2d ed., John Wiley & Sons, Inc., 
New York, 1941, give excellent and simple treatments of the simpler curves, 
including the Pearl-Reed population curve, which we have not covered. 

EXERCISES 

1. The accompanying table shows the area A in square centimeters of a 
wound at various times T in days, after the wound was first made. Find the 
law which describes the rate of healing of the wound. The data are from 
the Journal of Experimental Medicine , Feb. 1, 1918. 
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7Z” 1 

0 

4 

8 

: 

12 

16 

20 

24 

28 ; 

32 

36 

40 

44 

A . 

26 

20 

11 

9 

8 

0 

5 

2 ! 

2H 

2 

1 

H 


2 . The rate at which the rootlets of pea plants grow seems to depend on 
the temperature. The accompanying figures, taken from Armais of Botany, 
1916, show the rate of growth in millimeters per hour G at various tem¬ 
peratures T in degrees centigrade. Find the law which describes this 
relationship. 


T . 

4 

8 

12 

1C 

20 

24 

G . 

0.12 

0 29 

0.47 

0 74 

l 01 

1 52 


3. The following table shows the length of time T which it takes for 
pendulums of various lengths L to swing through one vibration. Times are 
in seconds and lengths in inches. Find the “law of the pendulum.” 


L 

10 

20 

30 

40 

50 

60 

T 

0 506 

0 715 

0 876 

1 01 

1 13 

1 24 


4 . The following table shows the prices P in cents received by farmers for 
crops of various sizes or quantities Q in millions of bushels. Find the law 
connecting the two variables. 


200 

182 

167 

154 

143 

133 

125 

118 

111 

105 

100 

100 

110 

120 

130 

140 

150 

160 

! 

170 

1 

180 

190 

200 


5. G. K. Zipf (in “Human Behavior and the Principle of Least Effort,” 
p. 24, Addison-Wesley Press, Cambridge, Mass., 1949) gives data showing 
the frequency with which various words occur in James Joyce’s “Ulysses.” 
If we find the frequency of each word, and then rank the words in order of 
frequency, so that rank 1 is the word which occurs most frequently, rank 2 
that which occurs next most frequently, etc., we find an interesting rela¬ 
tionship between the rank ( R ) of a word and the frequency (F) with which 
it occurs as follows: 


R . 

10 

20 

30 

40 

50 

100 

200 

300 

400 

500 

1000 

F . 

2653 

1311 

926 

717 

556 

265 

133 

84 

62 

50 

26 


Find the law which connects these variables. How many times do you 
estimate the 5000th word would occur? (It actually occurred five times.) 

6. The International Athletic Federation on Oct. 2, 1950, recognized the 
following world’s record times for running foot races of various distances. 
Find the law which connects these two variables. Which records do you 
think are most likely to be broken? Which least likely? What do you 
think the record would be for a race of 1093.6 yd. if such a distance were 
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recognized? (Actually this distance is a kilometer, for which the world's 
record is 2 min., 21.4 sec.) 


Distance 

Record Time 
(minutes and 

(yards) 

seconds) 

100 

9 3s 

220 

20 3s 

440 

46 0s 

880 

1m 49.2s 

1,760 

4m 1.4s 

3,520 

8m 42.8s 

5,280 

13m 32.4s 

10,560 

28m 30 8s 

17,600 

49m 22.28 


7 . The accompanying tabic shows the yields per acre of potatoes in 
West Virginia by years from 1910 to 1934. Yields are in bushels per acre 
Fit a second-degree parabolic trend to the data. Graph the original data 
and the trend. What was the trend value in 1925? What was the actual 
yield that year? Does such a discrepancy invalidate the trend? Why 
would such discrepancies be apt to arise? 


Year 

Yield 

Year 

Yield 

1910 

91 

1923 

120 

1911 

45 

1924 

103 

1912 

111 

1925 

87 

1913 

82 

1926 

106 

1914 

55 

1927 

113 

1915 

117 

1928 

125 

1916 

87 

1929 

116 

1917 

115 

1930 

70 

1918 

87 

1931 

80 

1919 

90 

1932 

90 

1920 

119 

1933 

63 

1921 

91 

1934 

69 

1922 

100 




8 . The second-degree parabola fitted in the preceding problem might 
work very well for purposes of interpolation, but it would almost certainly 
work very poorly in extrapolation. Why? 

9. From the formula for the trend in Exorcise 7 above, would the trend 
pass through a maximum or a minimum point? When would this maximum 
or minimum point occur? 

10. Plot a graph of the values of the equation 

Y - 4 - 0.6X + 0.01 A 7 2 + 0.008X 8 9 10 

with the values of X running from —10 to +10. Locate the curve at all 
integral values of X and also at the points halfway between, that is, at 
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X *= —10, —9.5, —9.0, —8.5, . . . , 9.0, 9.5, and 10. How many bends 
appear in the curve? Does the maximum precede or follow the minimum? 
Does this jibe with the rules given in the text? 

11. Using the normal equations for the second- and third-degree parabolas 
as patterns, write the normal equations for a fourth-degree parabola. 
Students of the calculus should derive the equations and check the results. 

12. The accompanying table shows the volume V in cubic feet of a given 
quantity of a gas when various pressures P were applied in pounds per 
square inch. Find the law which describes the relationship between volume 
and pressure. (This is known as Boyle’s law.) 


p . 

10 

14.7 

20 

26 ! 

34 

45 

V . 

1 15 

0.78 

0.57 

1 

0 44 

0 34 

0.25 


18. When slices of fish brain cortex were tested in a microrespirometer it 
was found that the brain tissue metabolism varied with the temperature. 
The table below shows the oxygen consumption C in milliliters of oxygen per 
milligrams of wet weight per hour, at various temperatures T in degrees 
centigrade. 1 Find a mathematical formula for estimating oxygen con¬ 
sumption at various temperatures. Explain what tests you applied in 
deciding the type of curve to fit. 


T . ! 

0 0 

2 5 I 

5 0 

7 5 

10 0 

15 0 

20.0 

25 0 

c.. .. 

1 

10 

i 

13 

16 

24 

1 

33 

! 

55 

90 

145 


1 Values are read roughly from a chart in It. Wennonland, A Volumetric 
Microrcspirometer for Studies of Tissue Metabolism, Science, Vol. 114, 
No. 2952, July 27, 1951, p. 102. 






CHAPTER XIII 


HISTORICAL DATA 

Any data which show how some variable changed with the 
passage of time are called historical data , or a historical series. 
In some cases the total period covered is very great, as when the 
scientist deals with rock formations or the glacial periods. In 
other cases the period covered may be but a small fraction of a 
second, as when the physicist measures the drop in electrical 
potential in a conductor after the switch has been turned off. 
Historical data need not record anything about important his¬ 
torical events, nor need they record something which happened a 
long time ago. Their sole identifying peculiarity is the fact that 
they involve passage of time. 

The whole concept of time is a troublesome one for the philoso¬ 
pher and for the scientist, and it is difficult to be sure just what we 
mean by the passage of time, or how we can measure it in any 
absolute sense. We shall not consider these matters here, how¬ 
ever, important and interesting though they are. We shall 
assume, as the layman ordinarily does, that we know what time is 
and how to measure it, and we shall confine our attention to 
studying some of the elementary problems which arise in all 
sciences with historical data. 

13 . 1 . Types of Movements in Historical Data.—While there 
are great differences in the data of different historical series, we 
can usually without much difficulty identify three major kinds of 
movements. In the first place, there are those cases where there 
is a general long-run tendency for the data to rise or fall in a long- 
continued pattern. Any long-time tendency for the data to 
increase or decrease is called a secular trend or a secular movement 
in the data. It is not necessary that the rise or fall continue each 
and every year throughout the period. If we have a quarter 
century during which prices tend generally to fall, we would say 
that there was a secular decline in prices during the period even 
though there might be an occasional isolated year in which prices 
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rose somewhat. As long as we can say that the period as a whole 
was characterized by an upward movement or by a downward 
movement, we say that a secular trend was present. When we 
say that a secular movement is “long-continued” we do not 
necessarily mean that it continued for several years. How long 
a period must be before we will call it a “long” one is relative. 
If we have put a strong germicide into a bacterial culture, and if 
we count each 10 sec. for 10 min. the number of organisms still 
alive, and if these 60 observations show a general pattern, we may 
well call it a secular movement; while if we are studying human 
death rates from tuberculosis with data stretching over 50 years, 
and we see a trend which continues for 3 months, we would not 
call it secular. A secular movement lasts for a long time in rela¬ 
tion to the data studied. 

In contrast with the long-continued secular movement, many 
series of historical data show rather rhythmical repetition. Thus 
daily average temperatures tend to rise from January until July or 
August and then fall again until February (in the temperate 
zones of the Northern Hemisphere), repeating the movement year 
after year. Likewise the temperature tends to run through a 
daily cycle, high just after noon and low just after midnight. 
Possibly it runs through still a third cycle with the coming and 
going of glacial epochs. Egg prices are low in the spring and 
high in the fall. The fever of a patient with tertian malaria 
runs through a 3-day cycle. A vibrating violin string may pass 
through a complete cycle 264 times each second. A cyclical 
movement is one which shows a characteristic repetition. If the 
cycle lasts 12 months, we call it a seasonal movement; if it lasts 
24 hours, we call it a diurnal cycle. 

Of course, secular and cyclical movements may both appear in 
the same historical series, as in Fig. 13.1, where we can see both 
the long-continued general rising trend and the shorter seasonal 
cycles. But it would be unusual in practice for the values all to 
lie exactly on a smooth sinuous curve like that of the figure. In 
practice we would be much more likely to find a situation like that 
depicted in Fig. 13.2, where the secular trend and the cyclical 
movement are still present, but where the points seem to be 
scattered more or less at random around the line. The secular 
trend and the cycle both follow definite patterns with the passage 
of time, but the random movements or erratic movements or residual 
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movements show no time regularity whatever. They may, to be 
sure, be definitely related to other things; but they show no rela¬ 
tionship to time. Figure 13.2 is a combination of the three major 
kinds of movements which appear in time series: secular, cyclical, 
and random. 



1935 1936 1937 1938 1939 

Y ear and Month 


Fig. 13.1. Monthly gross sales of corporation A r , 1935-1939. An idealized 
combination of seasonal and secular movements. 



Fig. 13.2. Monthly gross sales of corporation X f 1935 1939. A hypo¬ 
thetical case combining secular trend (the straight line), seasonal move¬ 
ment (the regular wavy line), and random movements (shown by the depar¬ 
ture of the actual data, shown in circles, from the smooth curve). 


A statistician who is studying corporation X, whose sales are 
plotted in Fig. 13.2, may be interested in the long-run or secular 
movement. In that case he may want to eradicate the cyclical 
and random movements and study the secular trend by itself. 
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Similarly he may wish to isolate and study the cyclical movement; 
or he may wish to eliminate both secular and cyclical movements 
and study the residual or random movements. But regardless of 
the one in which he is interested, he will have to study them all— 
some because he wants to eliminate or remove them, and some 
because he wants to describe and study them. 

13.2. The Secular Trend. —In the two preceding chapters we 
have learned how to fit straight lines and curves to series of data. 
Where these data are historical, the resulting line or curve is a 
description of the secular trend. Thus we have already learned 
one of the most useful methods of discovering trends. Either 
freehand or least-square methods may be employed, and either 
straight lines or curves may be fitted. We may also find the 
trend value of data by the method known as the method of moving 
averages. This method is based on the assumption that minor 
variations in the Y variable are to be considered as unusual, and 
that they can be removed by the process of averaging. Suppose 
l wish to know what petroleum production was “normal” for 
1920. Would it not be fair to tell me the average production for 
1920 and the two or three years before and after? This is 
exactly what constitutes the process of computing the moving 
average. 

Let us go back to our figures of petroleum production. The 
original dates and production figures are repeated as the first two 
columns of Table 13.1. In the third column of this table we have 
the moving average itself. Opposite each year is the average 
production of that year and of the two years preceding and the 
two years following; that is, each figure in the last column is an 
average of five years’ production centered at the given year. 
The moving average production for 1917 is 330.2. This is the 
average production for 1915, 1916, 1917, 1938, and 1919. Simi¬ 
larly with each other year. In this case we have a five-year 
moving average. We might, of course, use the average of three 
years or seven years or some other number of years. Obviously 
if our average includes the same number of years before as after 
the given year, the total number of years in the period will be 
odd. Thus here we include the year itself, two years before, and 
two years after, or five years altogether. Usually we use an odd 
number of periods for our moving average and center the average 
at the middle year, as in the example given here. 
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The moving average is most commonly applied to data which 
are characterized by cyclical movements. It is employed to 
eliminate the cycles and leave the general trend of the data. We 
literally “average out” the seasonal or other cyclical variations. 
In such a case it is necessary to select a period for the moving 

Table 13.1.— Computation of Moving Average of Petroleum 
Production 


Year 

Output 
(millions 
of barrels) 

Moving- 

Average 

Output 

1906 

126 

? 

1907 

166 

? 

1908 

179 

172.8 

1909 

183 

191.6 

1910 

210 

203.0 

1911 

220 

216,8 

1912 

223 

233 4 

1913 

248 

247.6 

1914 

266 

263.8 

1915 

281 

286.2 

1916 

301 

307 8 

1917 

335 

330 2 

1918 

356 

362.6 

1919 

378 

396.8 

1920 

443 

441.4 

1921 

472 

516.6 

1922 

558 

583.8 

1923 

732 

648.0 

1924 

714 

707.8 

1925 

764 

776.4 

1926 

771 

810.2 

1927 

901 

868.8 

1928 

901 

? 

1929 

1007 

? 


average which coincides with the length of the cycle; otherwise 
the cycle will not be entirely removed. When the period of the 
moving average and the period of the cycle in the data differ, 
the moving average will display a cycle which h&s the same period 
as the cycle in the data ; but which has less amplitude than the 
cycle in the data. Often the statistician finds that the cycles in 
the data are not of uniform length. In such a case he usually 
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takes a moving-average period equal to or somewhat greater than 
the average period of the cycle in the data. 

We have noted that one usually selects an odd number of 
periods for his moving average, since the process of centering is 
then simplified. But if data have a marked cycle which extends 


Table 13.2. —Four-yeak Moving Average or Petroleum Production 



Output 

Four-year 

Moving 

Year 

(millions 

Moving 

Average 


of barrels) 

Average 

Centered 

1900 

126 


? 

1907 

166 


? 

1908 

179 

loo o 

174 0 

1909 

183 

104 0 

191 2 

1910 

210 

19© u 

203 5 

1911 

220 

209 0 

217 1 

1912 

223 

225 25 

232 2 

1913 

248 

259 25 

24(5 9 

1914 | 

266 

2 »>'i o 

264 2 

1915 ! 

281 

2/4 U 

one: 

284 9 

1916 

301 

295 / 5 

307 0 

1917 

335 

318 25 

Q /i O C 

130 0 

1918 

356 

542 D 

070 n 

360 2 

1919 

378 

o/o U j 
xllO 1 

1 395 2 

1920 

443 

412 20 1 

462 75 j 

437.5 

1921 

472 

r r 1 or I 

507 0 

1922 

558 

.10 1 20 

/■ i q n 

585 1 

1923 

732 

tuy u 

iino n 

655 5 

1924 

714 

092 U 

— i r or 

718 6 

1925 

764 j 

/ 45 25 

707 “i 

766 4 

1926 

771 | 

a“i O 

834 25 

810 9 

1927 

901 

cue: no 

864 7 

1928 j 

90 L ! 

o.M llu 

? 

1929 | 

1007 | 


? 


I 


over an even number of periods (as, for example, a 12-month 
cycle, which is very common) it is necessary to take an even 
number of periods for the moving average. If we take, for exam¬ 
ple, a four-year moving average of the data in Table 13.1, showing 
the results in Table 13.2, we discover that the moving averages 
appear between the years rather than at the years. The first 
figure in the third column, 163.5, is the average petroleum produc- 
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tion for the years 1906, 1907, 1908, and 1909. The center of this 
series of years is halfway between 1907 and 1908, and we there¬ 
fore enter our moving average halfway between them. But 
ultimately we want the value for each year, and not the value at 
points between. Therefore when our moving average is based 
on an even number of periods we add a fourth column, which is 
a centered two-period moving average of the third column. 

The first figure in the last column of Table 13.2 is 174.0, the 
average of the first two figures in the third column. It is entered 
halfway between the first two figures of the first column, which 
sets it opposite the year 1908. The other figures in the last 
column are found similarly. 

13.3. The Progressive Mean.— When we are told that the 
five-year moving average of petroleum production was G48 
million barrels in 1923 (figures from Table 13.1), we understand 
that this is not necessarily the actual output (which was 732 
million barrels), but it is a “normal” output for the five-year 
period centered at 1923. It is the simple arithmetic average 
of the outputs of the five-year period centered at 1923. It has 
been suggested by some statisticians that in a case of this kind 
the center year should be given more weight in computing the 
average than the other years, and the farther we go from the 
center of the period being averaged the less weight we should 
give. It would be a simple matter, of course, to compute a 
weighted moving average, always weighting the center year 10, 
the year each side of the center 6, and the second year from the 
center in either direction 1, or any other such set of weights, 
diminishing as we draw farther from the center of the period. 
When such weighting is used, it has been common to weight 
the years with the coefficients of the binomial expansion which 
has the requisite number of terms. If we turn to Table 7.1, 
page 163, we discover that when the binomial has five terms the 
coefficients are 1, 4, 6, 4, and 1. If we were to find the weighted 
arithmetic average of the first five years of Table 13.1, using 
these weights, we should perform the computations shown in 
Table 13.3. 

Using now our formula for the weighted arithmetic mean (see 
page 63), we have 


t _ 2(XW) _ 2806 
2 W 16 


175.4 
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This average would be set opposite the central year, 1908. 
Similarly, we would compute a weighted arithmetic mean for 
each other set of five consecutive years, using each time these 
same weights, and always setting the weighted mean opposite 
the central year of the period. When the weights used in com¬ 
puting the weighted moving average are the coefficients of the 
expanded binomial, as in the example just given, the result is 
known as a progressive mean; that is, a progressive mean is a 
weighted moving average with the binomial coefficients as 
weights. It is obvious that the work entailed in computing a 
Table 13.3. —Computation of the Progressive Mean 


Year 

| Output 

(X) 

Weight 

(W) 

(XW) 

1906 

126 

1 

126 

1907 

166 

4 

664 

1908 

179 

6 

1074 

1909 

183 

4 

732 

1910 

210 

1 

210 

Totals . 

16 

2806 


progressive mean is far greater than that required for the com¬ 
putation of the ordinary unweighted moving average, and as a 
result the latter is far more common in practice. 

13.4. Advantages and Disadvantages of the Moving Average.— 
The moving average is a simple concept, easily understood, and 
gives a good picture of the general long-run movement in data 
which contain rather uniform cycles if the long-run trend of the 
data, if any, is roughly linear. With a long-run curvilinear trend 
the moving average will tend to give values which are too low if 
the trend is concave downward, and too high if it is. concave 
upward, and in such cases it is better to use one of the mathe¬ 
matical curves described in the preceding chapter. Of course, 
if the trend appears to be semilogarithmic we could fit a moving 
geometric average instead of a moving arithmetic average. This 
would be accomplished by finding the ordinary moving average 
of the logarithms of the values rather than of the values them¬ 
selves. In such cases, however, it would probably be better to 
fit a semilogarithmic curve. The moving average is not fixed 
into any set pattern, as are the various mathematical curves, 
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and has the advantage that it follows the general movements 
of the data. Its shape is determined by the data rather than 
by the statistician’s choice of a mathematical function. 

Yet the moving average has the disadvantage that it cannot be 
carried to the extremes of the period studied. In Table 13.1, for 
example, we cannot find the values of the moving average for the 
years 1906,1907, 1928, and 1929. It is often these extreme years 
in which we are most interested; yet our method precludes us 
from finding the “normal” value in them. Extrapolation is 
obviously also impossible. 

13.6. The Nature of Cyclical Movements.—Cyclical move¬ 
ments differ from secular movements in that the former go 
through a given routine and then repeat it over and over again. 
Repetition is the essence of cyclical movement. 

Yet in most actual cases the repetitions are not exact. In 
Table 13.4 we list the monthly mean temperatures in New York 
City for a period of five years. It is immediately evident that 
there are seasonal regularities, with high temperatures in July 


Table 13.4. Monthly Mean Temperatures, New York City, 
1935-1939 1 


Month 

1935 

1936 

1937 

1938 

1939 

January. 

29 2 

29 9 

40.4 

32 0 

32 3 

February. 

31 6 

26.6 

34.9 

35.6 

37.4 

March. 

43 2 

45 3 

36.6 

44.2 

38.8 

April. 

49 5 

47.2 

49.0 

53.4 

47 8 

May. 

58 8 

62.6 

63 3 

59.4 

63.7 

June. 

68 6 

68.6 

70 6 

69.0 

70.8 

July . 

76.2 

74.8 

75.4 

75 1 

74.1 

August 

73 6 

74.1 

75.7 

76 3 

76.8 

September. 

64 2 

67.1 

65.2 

6*4 9 

67 4 

October. 

56 8 

57.0 

54.6 

58 6 

56 4 

November. 

48.6 

42 4 

45.6 

47 7 

43 2 

December. 

30 6 

39 2 

35 4 

37.2 

36 2 


1 Data from “World Almanac/’ p. 187, 1941. 


and August and low temperatures in January. Yet inspection 
of the data will show that exact repetition even in a single month 
hardly ever occurs. The only case in the five years where any 
month had the same monthly mean temperature twice is that of 
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June, 1935 and 1936. It is approximate repetition, and not 
exact repetition, that we look for in cyclical data. 

From the name we are apt to think of a cycle as a rounded, 
wavelike movement, similar, perhaps, to that shown in Fig. 13.1. 
Sometimes, especially in the “exact ” sciences, We come across 
data which exhibit such symmetrical regularity. Figure 13.3 
shows two complete cycles of this character. 1 When we wish to 
describe the cycle, we may wish to tell how long a typical cycle 
lasts. This would be the time elapsing between any point on 
one cycle and the corresponding point on the next cycle, but 
since it is hard at most parts of the cycle to say which points 



Fig. 13.3. Two cycles of a sine curve. 

“correspond,” it is common to measure from one peak to another 
or from one trough to another. The distance along our base 
scale marked with arrows in Fig. 13.3 measures the time from 
one peak to another. The time that is required for one complete 
cycle, measured as the horizontal distance from any point in one 
cycle to the corresponding point in the next cycle, is called the 
period of the cycle. Since it is a measurement on the horizontal 
base scale, the period of a cycle is always a length of time, sucli 
as a year or a month or 5 min. 

Cycles differ not only in period, but also in the extent of the 
‘‘up-and-down” movement—the height of the peaks and the 
depth of the troughs. This vertical distance, represented by 
the length of the broken vertical lines in Fig. 13.3, is called the 
1 The curve in the figure is a sine curve. 
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amplitude of the cycle. The amplitude is measured at right 
angles to the base scale, vertically, and therefore it is always in 
units of the non-time variable that it is being measured. For 
example, if we were referring to the temperature cycles of Table 
13.4, the period of the cycle would be a length of time (12 months 
in this case), and the amplitude of the cycle would be a number 
of degrees of temperature (about 43.5° in this case if we take 
the cycle very roughly as running from a low of 32° in January 
to a high of 75.5° in August). The usefulness of these two 
figures—period and amplitude—is evident at once. For exam¬ 
ple, weather data show that the amplitudes of annual tempera¬ 
ture fluctuations in various United States cities are roughly as 
follows: 


Boston. 

44° 

Charleston 

31° 

Chicago 

4* 

QC 

Miami 

15° 

Los Angeles 

10° 

Bismarck 

62° 


In each of these cases the period of the cycle is 12 months. 
But the difference in amplitude between Bismarck, North 
Dakota, and Miami, Florida, is startling. 

It is perhaps easiest to illustrate the ideas of period and ampli¬ 
tude with the sinuous, regular, rounded cycles of Fig. 13.3, but 
the student must not get the idea that all cycles are of this 
character. It is repetition at approximately equal t:me intervals 
and not smooth, flowing regularity which makes a cycle. 1 For 
example, the sales in a chain grocery store may run along at an 
approximate level from Monday to Friday, increase greatly on 
Saturday, and disappear entirely on Sunday. This would be 
a weekly cycle, even though when plotted it showed none of the 
wave motion of Fig. 13.3, but looked more like Fig. 13.4. Simi¬ 
larly the annual cycle of sales by a department store might show 
sudden very sharp increases just before Christmas and Easter, 
with sales disappearing entirely on Sundays and holidays. 

1 The word ‘‘cycle” and the word “circle” come from the same root, and 
perhaps it would have been better to have confined the idea of cycles to 
those circular functions which do exhibit the roundness and regularity 
which the student has come across in his study of trigonometry or the 
calculus. Usage in the field of statistics, however, justifies the definition 
above. 
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18.6. Common Periods of Cycles. —Cycles of any period can 
occur, but in practice certain lengths of period are much more 
common than others for reasons which it is easy to understand. 
Perhaps the commonest cycle is the annual or seasonal cycle, 
which is astronomical in origin, but which shows up in the data 
of almost every science. We have already noticed the annual 
fluctuations in temperature, but a moment’s thought will suggest 
similar annual cycles in the growth of vegetation, 1 the rates of 
metabolism among animals, school attendance, volumes of 
traffic, production of farm products, birth and death rates, etc. 



Fro. 13.4. Daily sales in H. D. Newton’s grocery store for the month of 
March, 1942. 


The annual cycle is of importance not only in astronomy, but 
also in all the biological and the social sciences, where the 
seasonal changes have important secondary effects. The annual 
cycle is of less importance, perhaps, in the physical sciences. 
When a cycle lasts 12 months we call it a scaso?ial movement , so 
we can say that a seasonal movement is one particular kind of 
cyclical movement—the one with a 12-month period. 

In many sciences there are also important daily or diurnal 

1 It is this annual cycle, of course, which produces the “rings” in trees by 
means of which we ascertain their age. Recent studies of the tree-ring 
cycles have made it possible to find the dates at which timbers were cut for 
hundreds of years in the past, and to learn something of weather cycles at 
times long before weather records were kept. 
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cycles . Some of the phenomena which we have just mentioned 
as exhibiting seasonal movements also exhibit diurnal cycles. 
For example, the temperature of the air, the volume of traffic, 
birth and death rates, etc., vary from one hour of the day to 
another. In the field of medicine it is well known that various 
diseases show typical diurnal cycles in the patient's temperature. 
A student's mental alertness and his rate of learning vary diur- 
nally. Marketing studies show that almost every kind of retail 
store has its busy hours of the day and its slack hours, varying 
for different kinds of stores, to be sure, but exhibiting typical 
diurnal cycles for any particular type. 

Although the week seems to be a far more arbitrary time unit 
than the day or the year, having far less basis in natural phenom¬ 
ena, nevertheless the week has become firmly enough fixed as 
part of our lives so that weekly cycles are not uncommon, in the 
social sciences particularly. There are marked weekly cycles 
in the sales and prices of many perishable farm products. In 
heavily settled parts of the country there are decided weekly 
cycles in highway traffic. Certain days of the week are days of 
large sales in department stores, and other days are days of 
little business. Studies by the personnel departments of large 
corporations show what is at first a surprising weekly cycle in 
the number of employees absent for sickness, some days of the 
week being chosen for such absences far more commonly than 
others. (Strange to say, it is just before, and not just after, the 
week end that these absences are most common.) Even death 
rates show a weekly cycle, especially in summer when automobile 
accidents and dr ownings make their mark. The week-end 
holiday has become an integral part of our lives, and its effects 
show up as a weekly cycle whenever men's habits come into play. 

These three periods—12 months, 1 week, and 1 day—are by 
far the most common periods when we look at the problem of 
cycles in general. In particular cases we may have cycles that 
last far longer, such as the sunspot cycle of just over 11 years; 
or cases which fall in between, such as the typical 3-day cycle 
of tertian malaria. A cycle is just as important, of course, 
whether it coincides with one of the three common periods or 
not. The student who is testing data for cyclical movements 
will do well, however, to look for these periods first. And, of 
course, it is evident from the examples just given that several 
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periods of cycles may be mixed together in the same data, as 
in the case of highway traffic outside a large city at 5 o’clock 
in the afternoon of a Sunday in the late summer, when we have 
the coincidence of diurnal, weekly, and annual peaks. 

13.7. Seasonal Variation Measured around the Moving Aver- 
age.^-We shall start our discussion of cyclical movements by 
the analysis of a case of seasonal variation, since that is perhaps 
the commonest of all lengths of cycle. Just as is true with the 



York City market, 1019-1923. {Data from Table 10.4.) 

secular trend, we may be interested in the nature of the cyclical 
movement itself, or we may wish to measure it so that we can 
eliminate it and study those movements which remain. We shall 
start by describing a seasonal movement, and then we shall 
eliminate it. 

Table 13.5 shows monthly egg prices in New York City from 
1919 through 1923. These data are shown graphically in Fig. 
13.5. The most noticeable feature of this chart is the fact that 
there are decidedly regular periodic swings in the data. The 
seasonal movement dominates the\ whole chart. The move¬ 
ments are not uniform from year to year, to be sure; there are 
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Table 13.6.— Prices of Near-by-hennery White Eggs, New York Cm, 
by Months, J 919-1923 


Month 

1919 

1920 

1921 

1922 

1023 

January. 

72 

84 

76 

56 

57 

February. . 

66 

70 

52 

49 

47 

March. . 

48 

59 

43 

38 

44 

April. 

62 

54 

38 

38 

39 

May. . 

53 

53 

33 

38 

40 

June. 

56 

56 

39 

43 

41 

July. 

63 

65 

50 

45 

45 

August. 

68 

71 

57 

55 

53 

September. 

75 

82 

71 

66 

62 

October. 

88 

100 

86 

82 

# 77 

November... 

98 

102 

95 

89 

'83 

December. 

82 

95 

78 

70 

64 


Table 13.6. — Average Monthly Price of Near-by-hennery White 
Eggs, New York City, 1919-1923 


Month 


Price 
(cents per 
dozen) 


January . 
February 
March . . 
April. . 

May. 

June 
July .. 
August 
September 
October .. 
November.. 
December 


69 0 
54 8 

46 4 
44 2 
43 4 

47 0 
53 6 
60 8 
71.2 
86.6 
93.4 
77 8 


variations in the amplitude of the waves. But the similarities 
of the successive yearly movements are much more striking than 
are the differences. If one marks the crest of each wave or the 
trough of each wave, he finds that there are 12 months from crest 
to crest or from trough to trough. This regular 12-month period 
is characteristic of the movements which we call “seasonal ” 
movements. Other cyclical movements are characterized by 





















HISTORICAL DATA 


401 


periods of other lengths. The peculiarities of the present move¬ 
ments become even more evident if we plot the prices of the 
various years one above the other, shifting the vertical scale of 
the diagram upward each year so that the years will lie in order. 
The chart shown in Fig. 13.G has been constructed in this way. 
The appearance of each year above the preceding year is not due 
to the fact that prices were higher, but to the fact that the 
vertical scale has been shifted. When we look at this chart, we 



M o n + h 

Fig. 13.6. Monthly egg prices in New York City, 1919-1923. The data 
for the various years are all on the same vertical scale, but each year the 
scale has been raised enough so that the cycles will not overlap. No two 
years are on the same base line. 

note again the striking similarity between the movements of the 
various years. 

If one were sure that there were no secular trend or long-time 
cycle in the data, one could describe the seasonal movement 
easily by computing the average January egg price, the average 
February egg price, etc. In this way he would get an average 
price for each month, as in Table 13.6. These figures make the 
seasonal movement very clear. They point out the times of 
high and the times of low prices. If we wished, we could convert 
these figures into an index of seasonal variation by computing 
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the average of the 12 monthly averages (which turns out to be 
62.35) and stating each of the 12 monthly average prices as a 
percentage of their average. This computation would give us 
Table 13.7. This table tells us that the January prices were, on 
the average, 11 per cent above the yearly average price; February 
prices were 12 per cent below the average of the year; etc. 

This method of describing the seasonal variation would not 
give the correct results, however, if the data included either a 


Table 13.7.— Index of Seasonal Variation in Egg Prices 


Month 

Seasonal 

Index 

January ... .... 

111.0 

February. 

88.0 

March. 

74 4 

April. 

71 0 

May. 

69 6 

June.... 

75 4 

July... 

86 0 

August.. ... 

97.6 

September . 

114 3 

October. 

139.0 

November... 

149.8 

December. 

125.0 




secular trend or a cycle other than the 12-month cycle. Hence 
the statistician commonly uses a method which is slightly more 
complex. It consists in finding the 12-month moving average 
of the data first, and in removing this moving average. Since 
the moving average is a 12-month one, it will have in it nothing 
of the seasonal movement; the monthly variations will be entirely 
ironed out. But the trend and cycles other than 12-month 
cycles will remain in the moving average. When, therefore, we 
remove the moving average from the original data, we shall be 
removing the trend and other cycles but not the seasonal varia¬ 
tion which we wish to study. 1 

1 If we were studying some cycle other than a seasonal one, we should, of 
course, use a different moving average. If, for example, a study of the 
graphed data led us to believe that there was a 14-year cycle, we should 
compute the 14-year moving average, etc. 
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Let us try this plan with our data on egg prices. We first 
tabulate the original data, and we then compute the 12-month 
moving average. We have discovered that a moving average 
is placed at the center of the period, and this would make us place 
the moving average halfway between June and July in the first 
year. We could adjust our moving average so that it would 
properly be placed at July instead of halfway between June and 
July, 1 but since such an adjustment ordinarily makes no sig¬ 
nificant difference in the final results, we shall omit it here and 
center each year’s moving average at the seventh month. Also, 
in order to save time, we may omit the division of each 12-month 
total by 12; that is, we may use the moving total rather than the 
moving average. This makes no difference whatever in the 
result, for a reason that will soon be evident. 

Table 13.8 gives for each month the price of eggs (as shown in 
Table 13.5, page 400), the moving total of prices centered at the 
seventh month, and the percentage which the former is of the 
latter. Each figure in column 2 is the price of eggs for the indi¬ 
cated month. Each figure in column 3 is the sum of the price 
in the indicated month and the prices in the six months before 
and the five months after; it is the moving total of 12 months’ 
prices centered at the seventh month. Each figure in column 4 
is the percentage which the corresponding figure in column 2 is 
of the corresponding figure in column 3. If we take as an exam¬ 
ple the month of July, 1919, we find that the price of eggs was 
63 cents, the sum of the July price and the prices of the six 
months before and the five months after is 811 cents, and 63 cents 
is 7.76 per cent of 811 cents. 

We have seen that one of the difficulties of the moving average 
is that it cannot be extended to the extremes of the data (see 
page 394). In this case it means that we have no moving total 
for the first six or the last five months; the last two columns are 
therefore vacant for these months. The moving average has the 
advantage, however, that it is flexible; and while it does in this 
case eliminate any and all regular 12-month movements, it does 

1 Since the first figure for the moving average would represent a point 
halfway between June and July, and the second a point halfway between 
July and August, the average of these first two figures would represent 
July. Thus we could take the 12-month moving average and then take a 
2-month moving average of the result, centering on the seventh month. 
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Table 13.8. —Elimination op Moving Total from Ego Prices 



-pj- 

mm 

mm 



Month 

Prioe 

Moving 

Per Cent 


(cents) 

Total 

l(2)/(8)l 


1919 




January. 


72 



February .... 


56 



March . 


48 



April. 

May 


52 

53 



June. ..... 


56 



July .. . 


63 

811 

7 76 

August . 


68 

823 

8.26 

September . . 


75 

837 

8 97 

October 


88 

848 

10 37 

November... . 


98 

850 

11.51 

December .. 

1920 

82 

850 

9.65 

January .. 


84 

850 

9.88 

February. 


70 

852 

8.21 

March 


69 

855 

6 90 

April. . 


54 

862 

6 25 

6 07 

May. . . . 


53 

874 

June. 


56 

878 

6 38 

July. 


65 

891 

7 29 

August... 


71 

883 

8 05 

September 


82 

865 

9 47 

October 


100 

849 

11 78 

November. . 


102 

833 

12 27 

December 

1921 

95 

813 

11 69 

January. . . 


76 

796 

9 55 

February 


52 

781 

6.65 

March . 


43 

767 

5 61 

April. 

May . 

.*/’’* j 

38 

756 

5 02 


33 

742 

4 45 

June.... 


39 

735 

5 30 

July.... 


50 

718 

6 96 

August. 


57 

698 

8 17 

September 


71 

695 

10 20 

October 


86 

C90 

12 45 

November 


95 

690 

13.77 

December 

1922 

78 

695 

11 21 

January . . 


56 

699 

8.01 

February . 


49 ! 

694 

7.08 

March 


38 

692 

6 50 

April 

May.. 


38 

687 

5 64 


38 

683 

5.67 

June .... 


43 

677 

6.35 

July. 


45 

609 

6 76 

August .. . 


55 

670 

8 20 

September 


66 

608 

9 89 

October. 


82 

674 

12 18 

November 


89 

675 

13 19 

December. 

1923 

70 

677 

10.31 

January 


67 

675 

8 45 

February 


47 

675 

6.97 

March 


44 

673 

6 55 

April 

May. 


39 

669 

5.84 


40 

664 

6 04 

June. 


41 

658 

6 23 

July. 


45 

652 

6.90 

August . 


53 



September .. 


62 



October 


77 



November 


83 



December 


64 
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not eliminate the trend or cyclical movements. These are still 
contained in the moving average, and when we remove the mov¬ 
ing average they are removed with it. 

Had we computed the moving average rather than the moving 
total, each figure in column 3 would have been divided by 12. 
Since each figure in column 3 would then be 3^2 as large, each 
figure in column 4 would be 12 times as large. But the relative 
sizes of the figures in this column would be unchanged. We are 
interested in the relative sizes of these figures and not in their 
absolute size. For this reason we save time by omitting the 
division by 12 in column 3—that is, by using the moving total 
rather than the moving average. 


Table 13.9.— Arrays of Monthly Deviations 


Month 

Results 

January. 


8.01 

8.45 

9.55 

9.88 

February- 


6.65 

6.97 

7.08 

8.21 

March. 


5.50 

5.61 

6.55 

6.90 

April. 


5.02 

5.54 

5.84 

6.26 

May. 


4.45 

5.57 

6.04 

6.07 

June. 


5.30 

6.23 

6.35 

6.38 

July. 

6.75 

6.90 

6.96 

7.29 

7.76 

August. 


8.05 

8.17 

8.20 

8.26 

September 


8 97 

9.47 

9.89 

10.20 

October. 


10 37 ! 

11.78 

12.18 

12.45 

November. 


11.51 

12.27 

13.19 

13.77 

December. 


9.65 

10.31 

11.21 

11.69 


Note the meaning of the figures in our last column. Each is 
stated as a percentage of the moving total. If a figure is large, 
the price in that month was high as compared with the moving 
total. Thus we have eliminated the moving total (ihis corre¬ 
sponds, as we have seen, to removing the moving average) from 
our data. The figures in the last column retain whatever sea¬ 
sonal and random movements were present in the original data, 
but do not include the secular and long-time cyclical movements. 

Let us now gather together the results for each month for pur¬ 
poses of comparison. For each month other than July we shall 
have four figures, and for July we shall have five. In assembling 
them let us arrange the results for each month in order of size—in 
an array. The results are given in Table 13.9. 
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We could, of course, take the mean of these figures for each 
month, but with so few figures for each month this would be 
likely to give undue weight to extreme items. It is, therefore, 
more common to take the median. The median will be halfway 
between the second and third figure (save in the case of July, 
when it will be the third figure). The medians, then, will be 
these: 


January 

9 00 

July . 

. 6.96 

February ... . 

7.02 

August . . 

. 8 18 

March .... 

6.08 

September . 

. 9 68 

April . . 

5 69 

October . . 

. . . 11.98 

May . . 

5 80 

November . . 

. ... 12.73 

June. 

. 6 29 

December .. . 

. . 10 76 


These figures show the median per cent which the price for each 
month was of the moving total. We can easily make an index 
of seasonal variation from these figures by computing the average 
for the year and finding what per cent the figure for each month 
is of this average. The average of the 12 figures just given is 

8.35. If we state the figure for each month as a percentage of 

8.35, we get the index of seasonal variation shown in Table 13.10. 


Table 13.10.— Index of Seasonal Variation in Egg Prices 


Month 

Index 

January. 

107.8 

February. 

84.0 

March. 

72.9 

April. 

68 1 

May. 

69.5 

June. 

75 3 

July. 

83.4 

August. 

98.0 

September . 

115.9 

October... 

143.5 

November . ... 

152.4 

December . 

128.7 


If this index of seasonal variation is compared with that which 
was computed by the simpler method on page 402 it will be seen 
that, although the general nature of the seasonal movement is 
the same by either method, there are, nevertheless, some fairly 
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sizable differences in results. The method just outlined is, for 
the reasons already noted, the method to be preferred. 

These figures give us a good picture of the seasonal movement 
itself. They tell us that egg prices tend to be at their peak in 
November and “reach bottom” in April. They tell us, more¬ 
over, that at the peak the prices tend to be over 50 per cent above 
the season's average, and that in the trough they fall to a point 
more than 30 per cent below the season's average. We can find 
the times of high prices find the times of low prices, and we can 



Month 

Fio. 13.7. Seasonal movement of egg prices Data from Table 11.6. 

also get some idea of the amplitude of the movement. The 
highest prices tend, on the average, to be more than double the 
lowest prices (see Fig. 13.7). 

To summarize the steps necessary for finding an index of 
seasonal variation based on the moving average, we have the 
following: 

1. Tabulate the original data. 

2. Compute a 12-month moving total centered at the seventh month. 

3. Divide each original entry by the corresponding moving total. State 
the result as a percentage. 

4. Sort out these percentages by months, and find the median percentage 
for each month. 

6. Express each of these monthly medians as a percentage of the average 
of all 12 monthly medians. This is the index of seasonal variation. 
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13.8. Seasonal Variation by Link Relatives.—Another common 
method of measuring seasonal variation is by means of what are 
called link relatives . Whenever we have a set of figures for a 
number of months (or other time periods), we find the link 
relative for any month by dividing the figure for that month 
by the figure for the preceding month and multiplying the 
quotient by 100. In other words, a link relative for any month 
is the percentage which the value that month is of the value in 
the preceding month. We shall illustrate with the same egg 
prices with which we have just been working. The original 
figures appear in Table 13.8. They have been converted to link 
relatives in Table 13.11. Since a link relative for any month 


Tabus 13.11.— Link Relatives of Egg Prices in New York City, 

1919-1923 


Month 

1919 

1920 

1921 

1922 

1923 

January. 


102.4 

80.0 

71.8 

81 4 

February .... 

77 8 

83.3 

68.4 

87 5 

82 5 

March. 

85 7 

84 3 

82.7 

77 6 

93 6 

April. 

108 3 

91.5 ! 

88 4 

100.0 

88 6 

May. 

101 9 

98.1 

86 8 

100 0 

102 6 

June. 

105 7 

105 7 

118 2 

113.2 

102 5 

July. 

112 5 

116.1 

128 2 

104 7 

109.8 

August. 

107 9 

109.2 

114.0 

j 122 2 

117.8 

September. 

110 3 

115.5 

124.6 

| 120.0 

117.0 

October. 

117 3 

122.0 

121.1 

124.2 

124 2 

November. ! 

111.4 

102.0 

110 5 

108.5 

107.8 

December. I 

83.7 

93.1 

82 1 

i 

78.7 

77 1 


involves comparison with the preceding month, we do not know 
the link relative for our first month, January, 1919. Each of 
the other link relatives is found by the method just described. 
For example, if we want the link relative for September, 1923, 
we note (see Table 13.8) that the price of eggs in September, 
1923, was 62 cents, while in the preceding month the price was 
53 cents. Our computation is, then 

MOO) _ 117 .0 

< 53 

This answer shows that the price of eggs in September, 1923, was 
17 per cent above the price of the preceding month. Whenever 
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prices are rising, the link relative will exceed 100. When the 
price is falling, the link relative will fall short of 100. 

Inspection of Table 13.11 will show that there are some months 
(such as September) in which the price is almost always higher 
than it was in the preceding month, while there are other months 
(such as February) when the price is almost always below that 
of the preceding month. If we want some idea of the typical 
situation in any month, it is natural for us to take some sort of 
average of the link relatives for that month. We might, of 
course, take the arithmetic mean of the link relatives, but, as 
we pointed out in the preceding section, when there are so few 
figures one or two erratic values will throw the arithmetic mean 
very far off. For that reason we take the median link relative 
for each month. These medians are 


January 

80.7 

July. 

112 5 

February 

82 5 

August 

114 0 

March 

84 3 

September 

117 0 

April 

91.5 

October 

122 0 

May 

. 100 0 

November 

108 5 

June 

. 105.7 

December 

82 1 


We now set up what are called chain relatives for the various 
months. This is done by setting the first month arbitrarily 
equal to 100.0, and determining the chain relative for any other 
month by multiplying the median link relative of that month 
by the chain relative of the preceding month. This gives us 
the following chain relatives: 


January 

. 100 0 

July . 

75.6 

February 

... 82 5 

August. 

86 2 

March 

69 5 

September 

100 9 

April 

63.6 

October. 

123.1 

May 

63 6 

November . . 

. . 133 6 

June 

67 2 

December 

.. 109.7 



January 

88 5 


It will be noticed that we have carried the computation clear 
around to include January for a second time, this second figure 
for January being obtained, like any other chain relative, by 
multiplying the month's link relative (80.7) by the chain relative 
for the preceding month (109.7). If it were not for such things 
as the influence of trend, the rounding off of numbers, and the 
fact that we have used the median link relative rather than the 
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arithmetic mean, we should have come back to our original 
100.0 with our second January. But things seldom work them¬ 
selves out so smoothly, and we are therefore usually obliged, as 
we are here, to make some adjustment. Our final figure, 88.5, 
is too low by 11.5 per cent. We shall add 3d^th of the discrep¬ 
ancy (0.9583 per cent) to February, ^2 °f the discrepancy 
(1.9106 per cent) to March, etc. This gives us the following 
index adjusted for trend: 


January.. . . 

. . 100.0 

May. 

.... 67 4 

September 

. . 108 6 

February.. . 

. .. 83.5 

June. 

... 72 0 

October 

131 7 

March. 

... 71.4 

July. 

.... 81 3 

November . . 

. 143 2 

April . 

. . 66.5 

August. 

... 92.9 

December 

. 120 2 


If the second January chain relative had been larger than 100, 
it would have been necessary to reduce the various months by 
their proportionate amounts, just as we increased them in this 
case because the second January was too small. 

It is now common, as a last step, to center the index of seasonal 
variation, so that the averages of the monthly indexes will be 
100. This is done by dividing each of the crude indexes in the 
preceding table by the average of all 12 monthly indexes. The 
average of the 12 indexes in the preceding table is 94.9, and if 
we divide each of the 12 crude indexes by 94.9 and then multiply 
by 100, we get the following final index of seasonal variation 
based on link relatives: 


January. 105 4 May . . 71.0 September 113 4 

February . 88 0 June 75.9 October 137 7 

March. 75 2 .July. 85 7 November 150 9 

April. ... 70 1 August .. . 97.9 December 116 I 


The student will wish to compare the results obtained by 
this method with those given in Table 13.10, page 406, which we 
obtained by the moving-average method. While the two sets 
of figures are by no means identical, nevertheless they do show 
very definitely the same general sort of seasonal movement. 

Perhaps it would be wise to summarize the steps necessary 
in the link relative method, since the process is not so difficult 
as it may seem when the illustration has been run through so 
many pages. The steps involved are these: 

1. Convert the original data into link relatives, by dividing each entry 
by the one which precedes it and multiplying the quotient by 100. 
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2. Sort out the link relatives by months, and compute the median (or 
arithmetic mean) link relative for each month. 

3. Compute a set of chain relatives, by setting the first chain relative 
equal to 100, and finding each other chain relative by multiplying the link 
relative for the period by the chain relative for the next preceding period. 
Carry this process through to include the first unit of the next period (that 
is, when dealing with monthly data, carry it through to include the next 
January). 

4. If the last chain relative computed in the preceding step is not 100, 
adjust for trend by adding or subtracting a correction factor. If the final 
chain relative is larger than J00, the correction factor is to be subtracted. 
If the final chain relative is smaller than 100, the correction factor is to be 
added. The first month is kept at 100, but the correction factor for the 
next month is 2 the amount by which the last chain relative differs from 
100, the correction factor for the third month is K 2 of this amount, then 
M 2 , Mi, etc. When working with other than monthly data, we can work 
as follows. Let d be the difference between the last chain relative and 
100. Let « be the number of subdivisions in our period. (Above it was 
12 since there were 12 months in our period. For weekly cycles 8 might 
be 7, etc.) Then the correction factors for the succeeding subdivisions are 
d/s , 2d/s, 3 d/s, . . . , sd/s. 

5. Bring the final index to the level of 100 by finding the average of the 
indexes adjusted for trend in step 4, and then dividing each of these indexes 
by their average. 

The chain relative method is usually faster than the method 
based on the moving average, and the results are usually reason¬ 
ably similar. There is little theoretical advantage of either 
system over the other for ordinary eases. 

13.9. The Elimination of Seasonal Movements.—So much for 
the methods that are used to describe seasonal movements. 
Let us now turn our attention to the problem of removing the 
seasonal movement so that we can study the remaining charac¬ 
teristics of the data without having them obscured by the seasonal 
swings. As we saw in Fig. 13.5, page 399, the seasonal move¬ 
ments in the prices of eggs are so pronounced that they hide the 
other movements almost completely. In eliminating this sea¬ 
sonal swing, wc shall use the index of seasonal variation based 
on the moving average, which is tabulated on page 406. 

The simplest way to eliminate the seasonal movement is to 
divide the actual price for each month by the index of seasonal 
variation. This index is really a percentage, and the January 
index of 107.8 can therefore be thought of as 107.8 per cent, or 
1.078. We should divide all January figures by 1.078 to make 
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them somewhat smaller. The January figures are all too large 
by 7.8 per cent because of the time of the season. They should 
be reduced 7.8 per cent to be comparable with the figures for the 
other months. Similarly the prices are always low in June 
because of the time of year, and if a price is 75.3 per cent of 
normal it is just where it belongs. If, then, we divide it by 0.753, 
we make it comparable with the prices of the other months. If 
we divide each month's price by the seasonal index for the corre¬ 
sponding month, we get the prices corrected for seasonal variation 
as shown in the following table: 


Month 

1919 

1920 

1921 

1922 

1923 

January. 

66.9 

78.0 

70.6 

52.0 

52.9 

February. 

66.G 

83.4 

61.9 

58.4 

56.0 

March . . 

65.8 

81.0 

59.0 

52.1 

60.4 

April. . 

76.4 

79.3 

55.8 

55.8 

57.3 

May. 

76.3 

76.3 

47.5 

54.6 

57.5 

June. 

74.4 

74.4 

51 7 

57.1 

54.4 

July. 

75.6 

78.0 

60.0 

54.0 

54.0 

August. 

09.4 

72.5 

58.2 

56.1 

54.1 

September . 

64.9 

70.9 

61.4 

57.0 

53.6 

October. 

61.4 

69.8 

60.0 

57.2 

53.7 

November.. . 

64.3 

66.9 

62.4 

58.4 

54.5 

December. . 

63.8 

73.9 

60.7 

54.5 

49.8 


When these prices are plotted, we discover that the seasonal 
fluctuations have been entirely eliminated but that the secular 
and random movements are still present. In fact the latter 
stand out much more clearly now than before. The prices with 
seasonal eliminated are shown in Fig. 13.8, which should be com¬ 
pared with Fig. 13.5 showing the prices in their original form. 
Points on the chart in Fig. 13.8 which show high prices mean that 
the prices were high for the time of year in question. A price 
which is “high” for April might be a “low” price for November. 
On this chart we have adjusted all prices for the seasonal varia¬ 
tion which usually occurs, and the variations which are left are 
variations from the usual seasonal position. It is common to 
speak of such prices as prices which have been corrected for sea¬ 
sonal variation, or to speak of them merely as prices with seasonal 
eliminated. 
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To summarize, we eliminate the seasonal from data after the 
index of seasonal variation has been found by dividing the original 
data for any month by the seasonal index of that month (remem¬ 
bering that the seasonal index is a percentage and pointing off 
two places accordingly). 

13.10. Random Movements. —We have now found how to 
describe and to remove both secular and cyclical movements. If 
we have done the work accurately, there should be no movements 
left in the data which are coordinated with the passage of time. 
To be sure, we shall not have removed all variations from the 



Fig. 13.8. Monthly egg prices, 1919-1923, with seasonal eliminated. 
Data from page 412. 

data. There will still be movements reflecting the differences in 
the quality of eggs, or the quantity shipped to market, or changes 
in consumer tastes, or changes in the prices of substitute com¬ 
modities, etc. We have not tried to eliminate all the movement 
in egg prices, but merely those which show some temporal regular¬ 
ity. The movements which are left should show the effects of 
changes in nontemporal forces. 

Inspection of Fig. 13.8 shows that we have eliminated fairly 
well the seasonal swings, but we still have the secular trend left. 
The chart shows a rather definite and fairly linear downward 
trend of prices. We could fit a straight line to the data of Fig. 
13.8 by the method of least squares, but let us eliminate the trend 
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by the easier method of a freehand line. If the student will 
stretch a string or hair, or lay a transparent ruler over the figure, 
shifting it until it shows the general direction of the trend, he will 
see that the trend line crosses the left-hand vertical axis at a point 
which represents approximately 70 cents, and crosses the right- 
hand axis at approximately 52 cents. Thus the drop for the 
entire period is 18 cents. A drop of 18 cents in 60 months is a 
drop of 0.3 cents per month. We note, then, that our figures are 
too high by 9 cents in January, 1919. We subtract 9 cents for 
that month, 0.3 of a cent less than 9 cents for the next month, 
2(0.3) of a cent less than 9 cents for the following month, etc. In 
other words, the amounts that we subtract from the figures in the 
table on page 412 for successive months starting with the first one 
are 9.0, 8.7, 8.4, 8.1, 7.8, etc. After we pass the middle of the 
period, we shall be adding first small fractions of a cent and then 
more and more until at the end we add 9 cents. This will give us 
the data of Table 13.12, which contains egg prices with both secu¬ 
lar and seasonal movements eliminated. 


Table 13.12. —Monthly Ego Prices, 1919-1923, with Both Secular and 
Seasonal Movements Eliminated 


Month 

1919 

1920 

1921 

1922 

1923 

January . 

57.9 

72.6 

68.8 

53 8 

58 3 

February. 

57.9 

78 3 

60.4 

60 5 

61 7 

March. 

57 4 

76 2 

57 8 

54 5 

66 4 

April . 

68.3 

74.8 

54.9 

58.5 

63 6 

May. 

68.5 

72.1 

46.9 

57.6 

64 1 

June. 

66 9 

70.5 

51.4 

60 4 

61 3 

July. 

68 4 

74.4 

60.0 

57.6 

61 2 

August. 

62 5 

69.2 

58.5 

60 0 

61 6 

September. 

58 3 

67.9 

62.0 

61 2 

61 4 

October . 

55.1 

67.1 

60.9 

61 7 

61.8 

November. 

58 3 

64.5 

63 6 

63 2 

62.9 

December. 

58.1 

71.8 

62.2 

59.6 

58 5 


When we select any figure from Table 13.12, say the figure 70.5 
cents for June, 1920, we have to realize that this does not mean 
that the price of eggs in June, 1920, was 70.5 cents. As a matter 
of fact, reference to Table 13.8 will show that the actual price was 
56 cents. Table 13.12 tells us that, according to our best esti¬ 
mate, the price in June, 1920, would have been 70.5 cents if the 
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jsecular and seasonal movements had not been present. June egg 
prices are only about 75.3 per cent of the annual average on 
account of the regular seasonal swing (see index of seasonal varia¬ 
tion, Table 13.10), so eliminating the seasonal we estimate a price 
of 5G/0.753, or 74.4 cents. But the secular trend is high in June, 
1920, and requires a reduction of 3.9 cents, giving us a price, cor¬ 
rected for both trend and seasonal, of 70.5 cents as we saw in 
Table 13.32. The figures of this table are charted in Fig. 13.9. 
Although there is a noticeable similarity between Figs. 13.9 and 



Fig. 13.9. Monthly egg prices, 1919-1923, with seasonal and secular move¬ 
ments eliminated Data from Table 11.8. 


13.8, we note at once that the secular movement has been 
removed in the new chart. 

Suppose, now, that a research man is interested in finding what 
relationship there is between the quality of eggs and their price. 
He goes out on the market month after month and candles the 
eggs to find their quality. He notices the prices at which they 
sell. When he discovers that the price in November, 1920, was 
$1.02 per dozen, while in May of the same year it had been only 
53 cents per dozen (figures from Table 13.8), he might conclude 
that the higher November price reflected higher quality of eggs. 
But when he looks at the figures of Table 13.12, he discovers 
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corrected prices of 64.5 cents for November and 72.1 cents for 
May, 1920. The May prices were actually considerably higher 
than the November prices after allowance is made for time factors. 
The corrected figures of Table 13.12 should be far more useful to 
this investigator than the actual prices shown in Table 13.8. 
This is a good example of a case where one has studied the secular 
movement and the seasonal movement, not because he is inter¬ 
ested in them per se, but because he wants to eliminate them 
and study the relationship between other factors (like quality) 
and the random or residual movements which are left. The 
random movements of Fig. 13.9 are not at all noticeable in Fig. 
13.5, page 399, which show the actual prices. The secular and 
especially the seasonal movements in that chart obscure every¬ 
thing else. But Fig. 13.9 shows immediately and strikingly the 
movements in the original data which were not arranged in some 
temporal pattern—which were not explainable in terms of the 
passage of time. 

13.11. The Concept of the Statistical Normal.—Having 
described and “eliminated” the secular and cyclical movements, 
we may now wish to reconstruct them, 'omitting the random 
movements. Such a reconstructed idealized series, made up of 
the secular and cyclical movements, but omitting the random 
movements, may be thought of as showing the changes that we 
might have expected to have occurred under ** normal ” condi¬ 
tions, in the absence of temporary and sporadic forces. 

Suppose, for example, that you are asked, “What number of 
automobiles might one expect to pass the junction of U.S. Route 
11 and U.S. Route 20 in an hour?” Your answer would have to 
depend on what hour was being considered. The traffic at 2 a.m. 
and the traffic at 2 p.m. may well be different. A “normal” rate 
for one time would not be “normal” for another. And in addi¬ 
tion to this diurnal cycle, there is an annual cycle, with traffic on 
Labor Day “normally” different from traffic on “Groundhog’s 
Day.” There is likewise a weekly cycle, with Sunday traffic 
“normally” different from Wednesday traffic. And in addition 
there is a secular trend, with the “normal” traffic for 1950 far 
different from the “normal” traffic for 1910, even if we pick the 
same month, day, and hour of the year. 

Our secular trend might tell us that the basic hourly traffic 
was 42 cars per hour in 1940, with an increase of 8 cars per hour 
each year thereafter. We could put this in the form of a trend 
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equation, from which we could estimate the “normal” traffic for 
any ensuing year, with 1940 as the origin. Thus, for 1954 
the “normal” traffic based on secular forces alone would be 
42 + 14(8), or 154 cars per hour. But if we are interested in 
knowing the facts for Sept. 28, our seasonal index may tell us that 
the traffic on this day of the year is 114 per cent of that for the 
normal day of the year. Therefore we should expect not 154 
cars per hour, but 114 per cent of 154 cars; that is, we should 
expect 17G cars per hour. But this is for Sept. 28 on the average. 
If Sept. 28, 1954, falls on a Tuesday, and if our studies show that 
Tuesda3 r traffic is but 78 per cent of “normal” traffic for the 
week, then we should expect not 176 cars per hour (as on a typical 
Sept. 28), but 78 per cent of 176 cars; that is, we should expect 
137 cars. And finally, if the hour which we choose on Sept. 28, 
1954, is the hour from 4 to 5 o'clock in the morning, we realize 
that we should not use the average traffic figure for the day. Our 
diurnal index may indicate that the traffic at this hour is but 25 
per cent of the average traffic for the day. In that case we shall 
expect, not 137 cars in our hour, but 25 per cent of 137 cars; that 
is, we shall expect 34 cars. 

We have now come down to making an estimate for the particu¬ 
lar hour of 4 to 5 a.m. on Tuesday, Sept. 28, 1954. Our estimate 
of 34 cars is based on the long-run secular growth in traffic, and 
on the monthly, weekly, and hourly cycles. When we say that 
the “normal traffic” is 34 cars an hour, we are indicating that we 
should expect such an amount of traffic if we consider only those 
forces which work smoothly and regularly through time. Yet we 
do not mean that there must be exactly 34 cars passing the inter¬ 
section in this hour, for there are many forces other than time- 
connected forces, which we have not considered at all. A heavy 
rain, an unusually early ice storm, imposition of restrictions on 
the sale of gasoline or tires, a resurfacing job on the road which 
detours traffic—these and numberless other “ random ” forces may 
come in to upset our calculations. When we say that the 
“normal” traffic for the particular year, day, and hour is 34 cars 
per hour, we realize that the actual traffic may fall short of this 
figure or may greatly exceed it. 

These ideas can be expanded to cover any other statistical 
“normal.” When we talk of “normal” department-store sales 
over the Christmas holidays, or “normal” temperatures on the 
Fourth of July, or a “normal” price of eggs, or a “normal” yield 



418 


ELEMENTS OF STATISTICAL METHOD 


of hay, or a “normal” number of absences from school—in each 
of these cases we are setting up by more or less formal statistical 
means some “expected” value to be used for purposes of compari¬ 
son. But in computing our “normal” we never include all the 
forces that may affect the value in question. If we did include 
all the forces, then there could never be anything “abnormal”— 
we should always hit the nail exactly on the head. The very 
concept of “normality” implies its counterpart, abnormality. 
When people say, as they sometimes do, that there never is a time 
that is really normal, they are not, as they often seem to think, 
proving that the concept of normality is useless. The “normal” 
occurrence is not what happens, but what would have happened 
if there had been no unusual transitory forces at work to make the 
result abnormal. And in any field our idea of this abstract, 
hypothetical, idealized “normal” is made up by combining the 
effects of the various sorts of time-connected forces (secular, 
seasonal, diurnal, etc.), neglecting the “random” forces. 

13.12. Suggestions for Further Reading.—Several of the references men¬ 
tioned in Sec. 12.15, page 382, contain matter on time series in general and, 
hence, are applicable to cyclical as well as to secular movements. Karl G. 
Karsten, “ Charts and Graphs,” Prentiee-llnll, Inc., New York, 1923, gives a 
simple exposition in Chap. XXI. For dealing with cycles other than the 
12-month seasonal cycle, the student may wish to investigate what is known 
as periodogram analysis. A short treatment of the method is given m 
Harold T. Davis and W. F. C. Nelson, “Elements of Statistics w T ith Applica¬ 
tions to Economic Data,” pp. Principia Press, Bloomington, Ind., 

1935. The method is criticized adversely in the Journal of the American 
Statistzcal Association, Vol. 18, p. 889; and Vol. 22, p. 289. Many special¬ 
ized books in the field of business statistics contain discussions of the 
so-called “business cycle” and its treatment. In this field the student 
should see particularly Wesley C. Mitchell, “Business Cycles, The Problem 
and Its Setting,” National Bureau of Economic Research, Inc., New York, 
1927, particularly Chap. III. Henry L. Moore, “Economic Cycles: Their 
Law and Cause,” The Macmillan Company, New York, 1914; and William 
L. Crum, Periodogram Analysis, in “Handbook of Mathematical Statistics,” 
edited by Henry L. Rietz, Houghton Mifflin Company, Boston, 1924 are 
both authoritative and helpful. For a treatment of the problem of random 
movements, see Gerhard Tintner, “The Variate Difference Method,” 
Principia Press, Bloomington, Ind., 1940. 

EXERCISES 

1. What are the principal advantages of the moving average for showing 
secular changes? How does it differ from the progressive mean? 

2* Find at least two historical series Bhowing distinct seasonal movements. 
Find two others containing cyclical movements of a nonseasonal character. 
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3. On page 406 we took the median of the figures in the table for each 
month. Some authors prefer to take the arithmetic mean each month. 
How much would the index of seasonal variation have been altered if we had 
followed this other procedure? Compute the index on the latter basis for 
purposes of comparison. 

4. Compute an index of seasonal variation of New York City tempera¬ 
tures, using the basic data of Table 13.4, page 394. Use the moving aver¬ 
age method described in Sec. 13.7. 

6. Compute the index called for in the preceding exercise, but UBe the 
link relative method described in Sec. 13.8. 

6 . Eliminate the seasonal movement from the data of Table 13.4, page 
394, using as a basis your index of seasonal variation computed in one of the 
two preceding exercises. Interpret your “corrected” data. 

7. Annual and diurnal cycles have such an obvious astronomical basis 
that it is natural to expect to find cycles of these lengths in the data of 
almost every science. In how many separate sciences can you find illustra¬ 
tions of annual cycles? Diurnal cycles? 

8 . The weekly cycle has less natural basis than the annual or the diurnal 
cycle, yet the week has become firmly enough imbedded in our habits so 
that weekly changes appear in many kinds of data. In how many separate 
sciences can you find evidences of weekly cycles? 

9. What is the amplitude of the cycle of egg prices in Table 13.8, page 404? 
Use as your basis of measurement one of the indexes of seasonal variation 
computed in the chapter. 

10. Seasonal movements are by no means confined to prices. Table 
13.13 gives figures 1 showing the number of strikes beginning each month 
from 1927-1936. Compute the index of seasonal variation by the link 
relative method. 


Table 13.13.— Number of Strikes Beginning Each Month, 1927-1936 


Year 

Jan. 

Feb. 

Mar. 

Apr. 

May 

June 

July 

Aug. 

Sept. 

Oct. 

Nov. 

Dec. 

1927 

35 

63 


84 

95 

■ 

55 

56 

58 

50 

28 

33 

1928 

45 

46 

41 

69 

80 

44 

56 

53 

48 

60 

37 

25 

1929 

50 

51 

68 

121 

121 

77 

81 

86 

99 

73 


34 

1930 

49 

49 

47 

68 

58 

61 

79 

53 

68 

42 

36 

27 

1931 

58 

52 

m 

78 

104 

66 

67 

78 

81 

68 

57 

48 

1932 

88 

WK\ 

63 

89 

91 

74 

72 

89 

86 

Kn 

43 

36 

1933 

83 

67 

mm 

89 

161 

154 

237 

261 

233 

145 

87 

72 

1934 

98 

94 

161 

210 

226 

165 

151 

183 

150 

187 


101 

1935 

140 

149 

175 

180 

174 

189 

184 

239 

162 

190 

142 


1936 

167 

148 

185 

183 

206 

188 

173, 

228 

234 

192 

136 

132 

Totals.... 

813 

m 

969 

1171 

1316 

1098 

1155 

1326 

1 

1219 

1057 

756 

598 


11. Are the data of Table 13.13 of a type such that it might be wise to 
correct them for calendar variation? If so, make the necessary corrections. 

1 From Dale Yoder, Seasonality in Strikes, Journal of the American 
Statistical Association, Vol. 33, No. 204, December, 1938, p. 687. 

















CHAPTER XIV 


INDEX NUMBERS 

The use of index numbers has been pretty largely confined, in 
practice, to the fields of economics and business; yet the applica¬ 
bility of index numbers seems to be general enough so that there 
should be some gain in applying them more widely in other fields. 
To a very considerable extent, index numbers have been used to 
compare situations at different periods of time; yet they can also 
be used for making comparisons of different geographical areas, 
different business units, or almost any other sets of categories. 
Theoretically, index numbers can be used as broadly and as 
generally as any of the other statistical measures that we have 
treated, and the fact that their use to date has been pretty largely 
confined to a few fields of science should not preclude a considera¬ 
tion of them even in a general, nonspecialized textbook. 

Within the fields of economics and business, index numbers 
have been used to make many kinds of comparisons: comparisons 
of prices, of volumes of business, of costs of production, of 
employment, of wages, of volume of output, of buying power, of 
living costs, etc. Historically the first, and still the most com¬ 
mon, use of index numbers is in the making of comparisons of 
prices at different times. Of course, for a simple comparison of 
the prices of a single commodity at different times we do not need 
an index number. If we are told that a pair of shoes cost 50 
cents 1 in 1805 and $5 in 1905, we can make the comparison 
directly without the computation of any complicated statistical 
coefficients. To be sure, we would still have to satisfy ourselves 
that the shoes were of comparable quality, but the problem would 
obviously be far simpler than it would be if we asked what had 
happened to the cost of living in general between 1805 and 1905. 
In the latter case, we should doubtless find tfyat during the 
century the prices of some things had risen, some fallen, and some 

1 This is the actual average cost taken from an old family account book 
for a large family in upper New York State. 
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stayed about the same. It would no longer be possible merely to 
compare two simple figures. As soon as we get to making 
comparisons between complicated things, such as the cost of 
living or the prices of farm products, we need some sort of 
statistical help. 

14-1. A Simple Aggregative Index Number. —Suppose you 
have a list of the prices of 20 commodities in 1930, and another 
list showing the price of each of these 20 commodities in 1935. 
You note that some of the commodities have risen in price and 
some have fallen in price, and you are interested in knowing 
whether prices in general have risen or fallen. Of course, you 
might count the cases in which prices had risen and the cases in 
which prices had fallen, and if you found 14 increases and 6 
decreases you might conclude that prices had in general risen. 
Yet the 14 increases might be small and the 6 decreases large, in 
which case it is quite possible that the decreases would be more 
than enough to offset the increases. It is obvious that what you 
will need is some single summary figure that will characterize 
these increases and decreases. This is exactly the same problem 
that we faced in computing averages, and it is handled in approxi¬ 
mately the same way. 

Let us first take a hypothetical case. We shall assume that 
there are five commodities: A , B , (7, D, and E. They are of 
equal importance. Their prices in 1930 and in 1935 are given 
in Table 14.1. Inspection will show that three of these coun¬ 
table 14.1.— Hypothetical Price Data 


Commodity 

1930 

Price 

1935 

Price 

A 

SI. 00 

SI. 17 

B 

0.40 

0.30 

C 

0.60 

0 66 

D 

1.12 

1.12 

E 

0.60 

1.20 


modities have risen in price (A, C, and E). One has not changed 
in price (D), and one has fallen in price (B). How can we get a 
single summary figure that will tell us the amount of the change 
in price? 
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We could add the two columns of figures and find whether it 
takes more or less than before to buy one unit of each com¬ 
modity. This would give us the sum of $3.72 for 1930 and of 
$4.45 for 1935. In other words, it took somewhat more money 
to buy this bill of goods in 1935 than in 1930. We might well 
express the change as a percentage, letting the sum of the prices 
in 1935 be stated as a percentage of the 1930 sum. In this case 
we should say that when the 1930 prices are considered as 100 per 
cent, the 1935 prices are $4.45/$3.72 = 119.5 per cent. More 
commonly we should say merely that the 1935 index number 
on a 1930 base in 119.5. We see, then, that the base of an index 
number is the period that is taken as the basis for comparison. 
We let the prices of the base period be 100 per cent and compute 
the relation of the prices of other years to the prices in this base 
period. Instead of saying that the base is 1930, one would 
usually give the index numbers for the various years with the 
statement, “ 1930 = 100.” We see also from our simple example 
that an index number shows the percentage by which a group of 
values taken at one time or place differs from another group of 
values taken at another time or place. The index number which 
we have just computed is called the simple aggregative index 
number. Such index numbers are computed by adding the 
values for each year and stating the sum for each year as a per¬ 
centage of the sum in the base year. 

14.2. Averages of Relatives. —Another summary figure could 
be obtained by stating the price of each commodity in 1935 as a 
percentage of the price in 1930, and then taking some average of 
these percentage figures. If we convert the figures of Table 14.1 
to percentages of the 1930 price, we get Table 14.2. 

Table 14.2. —Relative Prices from Table 43, 1930 = 100 


Commodity 

Relative Prices 

1930 

1935 

A 

100 

117 

B 

100 

75 

C 

100 

110 

l) 

100 

100 

E 

100 i 

200 
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Each figure in this table is a percentage showing the price of the 
particular commodity in each year relative to the price in 1930. 
Such figures are called relatives , and since these are based on 
prices they are called price relatives or relative prices . Any num¬ 
ber is a relative if stated as a percentage of some other number. 

If we take the arithmetic mean of the price relatives for 1930, 
we obviously have 100 for an answer. The mean of the five 
relatives of 1935 is 120.4. We can say that the index number of 
prices is 120.4. This would be an index number based on the 
mean of the relatives . We might as well have computed the 
index number based on the median of the relatives or the geometric 
mean of the relatives or the harmonic mean of the relatives. Such 
methods are commonly used in index-number work. Our results 
by these methods would be as follows: 


Median relative 

no 

Geometric mean of relatives 

114 0 

Harmonic mean of relatives 

108 9 

Mean of relatives 

120 4 

Aggregative 

119 5 


The last two figures are from our earlier computations. 

It will be seen, then, that we have computed the index number 
by five methods and have found five different answers, ranging 
from a high of 120.4 to a low of 108.9. Each of these answers 
purports to show the percentage, on the whole, which 1935 
prices are of 1930 prices. All are based on the same figures; 
yet no two agree. This is not surprising, since we discovered 
when we were studying averages that the mean, the harmonic 
mean, and the geometric mean always (unless the values averaged 
are identical in size) differ. 1 

14.3. Bias in Index Numbers.—Again we see that the arith¬ 
metic mean is larger than the geometric mean, and That the 
harmonic mean is smaller than either. With these different 
results, which are we to use? Suppose that we selected a very 
large number of commodities for study. We found their prices in 
the base year, and then the year later we found the prices again. 
Suppose, moreover, that there had been no general change at all 
in price level. Individual commodities had changed in price, to 
be sure, but at random. There had been no movement down¬ 
ward and no movement upward in general. What should we 

1 See p. 109. 
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find at the end of the year? Obviously some of the price rela¬ 
tives would be larger than 100 and some smaller than 100. It is 
probably reasonable to assume that if the prices are changing at 
random (there being no fixed movement upward or downward) a 
price is as likely to double as to be cut in half. A commodity that 
doubled in price should just balance one that fell to half its former 
price. But note that the price relatives of these two commodities 
would be 200 and 50; that is, 200 for the one that doubled and 50 
for the one that was cut in half. The arithmetic mean of these 
two relatives is (200 + 50) -5- 2 = 125. Instead of showing no 
change, the arithmetic mean of the relatives shows a rising price. 

Let us, then, try the harmonic mean. The harmonic mean 
of the two numbers is 2 -f- (J^oo + Ko) = 80. By this method 
we find that prices have been falling; yet we know that there 
have been merely chance changes of price with no real rise or 
fall. When we try the geometric mean of the two prices, we find 

V 7 (200) (50) = VfpOO - 100 

This method, then, shows neither a rise nor a fall. In other 
words, the geometric mean gives a true picture of the situation. 
Rates of change can properly be averaged only by the geometric 
method. For this reason there is a decided advantage in using 
the geometric mean of the relatives in computing any simple 
index number. 

14.4. Weighting of Index Numbers.—Now let us face another 
problem. Going back to the price changes of the five commodi¬ 
ties tabulated on page 421, we may well ask if these commodities 
differ in importance. If milk rises in price by 2 cents per quart, 
the effect on the family budget is much greater than is an increase 
of 5 cents each on hairbrushes. In fact, a 5 per cent increase in 
the price of milk is much more important to consumers than a 
30 per cent increase in the price of hairbrushes. Yet with the 
methods we have been using they would be given equal weight in 
the index. If milk prices rose 2 per cent and hairbrush prices fell 
2 per cent, the mean of the price relatives would show no change— 
yet consumers would feel that prices had risen. To them the 
increased price of milk is not offset by the lowered price of hair¬ 
brushes. Obviously the thing for us to do is to take a weighted 
average of the price relatives rather than a simple average. We 
could do this by counting the milk price 10 times and the hair- 
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brush price once. It is simpler merely to multiply the milk price 
by some figure that represents its importance, the hairbrush price 
by some figure that represents its importance, etc. Thus for each 
commodity we shall have a weight which indicates the importance 
of the commodity. If our index number is to be one of the cost of 
living, we may well weight commodities according to their 
importance in the budget. 

Let us suppose that the relative importance of the five com¬ 
modities in our price table is as follows: 

A 2 

B 20 

C 1 

D 5 

E 1 

Thus commodity B is 10 times as important as commodity A, 
20 times as important as either commodity C or commodity E y 
and 4 times as important as commodity D . Let us multiply 
the price relatives from Table 14.2, page 422, by these weights 
and divide the sum of the products by the sum of the weights to 
get the weighted arithmetic mean of the relatives. This proce¬ 
dure gives the figures in Table 14.3. The total of the weights is 


Table 14.3. —Computation of Weighted Average Index of Relatives 


Commodity 

Price Relative 

Weight 

Product 

1930 




A 

100 

2 

200 

B 

100 

20 

2000 

C 

100 

1 

100 

D 

100 

5 

500 

E 

100 

1 

100 

1935 




A 

117 

2 

234 

B 

75 

20 

1500 

C 

110 

1 

110 

D 

100 

5 

500 

E 

200 

1 

200 


2 + 20 + 1+ 5 + 1 = 29. The total of the products for the 
year 1930 (the sum of the five figures in the first group of the 
last column in the table) is 2900. If we divide the latter figure 
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by the former, we get ® 100, the index number of the 

base period. In practice it would not be necessary to compute 
the index number for this year, since it is 100 by definition. If 
we take next the products for the year 1935, we find that they 
add to 2544. The index number for this year is, then, 

254 ^9 = 88 

This is the weighted average index of the relatives as contrasted 
with the simple average index computed before. 


Table 14.4.— Computation of Weighted Geometric Average of 

Relatives 


(1) 

Commodity 

(2) 

Relative 

Price 

(3) 

Log of 
Relative 
Price 

(4) 

Weight 

(5) 

(3)(4) 

A 

117 

2 06819 

2 

4 13638 

B 

75 

1 87506 

20 

37.50120 

C 

110 

2 01139 

1 

2.04139 

D 

100 

2 00000 

5 

10 00000 

E 

200 

2.30103 

1 

2 30103 

Totals.. , , , .. 

29 

55.98000 


The most noticeable feature of the present index as compared 
with those we computed before (see page 423) is that all the 
others gave values over 100, whereas this gives 88. The reason 
is easy to see. Before we used no weighting; we pretended that 
the commodities were of equal value. Yet really commodity B, 
the only commodity which fell in value, was far more important 
than all the others together. It is proper that it should have 
more influence on the result. We have now given it influence 
commensurate with its importance, and as a result the entire 
index has fallen. We can say, then, that prices are now but 
88 per cent of their 1930 value, and we shall be more nearly 
correct than before. For if the various commodities are given 
their proper weighting, the change in price has the same effect 
that a uniform drop of 12 per cent in all prices would have had. 

We could, of course, take the weighted geometric mean of the 
price relatives. This would involve raising each price relative to 
a power equal to the weight of the commodity, multiplying 
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together these figures for all the commodities, and taking a root 
equal to the sum of the weights. The work would be done by 
logarithms, and we can illustrate it by computing the weighted 
geometric mean of the relatives for the year 1935 (see Table 14.4). 
The logarithm of the index number is 55.98000/29 — 1.93034. 
This gives us a value of 85.2 for the index number, which is 
again lower than the weighted arithmetic mean. 

Likewise we may wish to weight the aggregative index number. 
We do this by multiplying the price of each commodity by its 
weight and adding the products. We then divide 100 times 
the sum for any given year by the sum in the base year, and the 
quotient is the required weighted aggregative index number. 
If we illustrate with the data we have just used, we get 


.“li) 

(2) 

Commodity 

Price 

1930 


A 

$1 00 

B 

0 40 

C 

0 60 

D 

1.12 

E 

0 60 

1935 


A 

SI 17 

B 

0 30 

C 

0 66 

D 

1 12 

E 

1 20 


(3) 

(4) 

Weight 

Product 

2 

S 2 00 

20 

8.00 

1 

0.60 

5 

5.60 

1 

0.60 


S16.80 

2 

$2.34 

20 

6.00 

1 

0.66 

5 

5.60 

1 

1.20 


S15.80 


To find the weighted aggregative index number for 1935 on a 
1930 base 


100(15.80) 

16.80 


94.1 


This, we see, is higher than either of the other two weighted 
index numbers which we have computed from these data. 
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14.6. Weight Bias. —We saw in Sec. 14.3 that, with chance 
variation among the relatives, the geometric mean yields an index 
of 100, but that the result by the arithmetic method is too high 
and that by the harmonic method too low. We say that the 
arithmetic method is characterized by an upward type bias and the 
harmonic method by a downward type bias; that is, one of them 
tends to give values which are too large and the other values 
which are too small. For this reason any simple (unweighted) 
index can be computed by the geometric method to advantage. 
But when we come to adding weights, as we have done in the 
more recent examples, we encounter the new difficulty that bias 
may arise in our index numbers from the weights which are used. 
Index numbers are commonly computed with fixed weights, as in 
our examples. That is, the weights for 1930 were the same as 
those for 1935; and, if we computed indices for other years, we 
should still use these same fixed-year weights . 

It is possible, of course, to use as weights values which change 
from year to year, using in each year a figure that shows the 
importance of the commodity in that year. Such weights are 
called given-year weights . Now fixed-year weights introduce a 
downward weight bias , and given-year weights introduce an 
upward weight bias in index numbers. If we start with the 
unbiased geometric method and introduce weights, we bias our 
results. If we use the arithmetic method with its upward-type 
bias and combine it with fixed-year weights (which have a 
downward weight bias), we overcome to some extent the type 
bias with the weight bias. If we use the harmonic method 
in conjunction with given-year weighting, we overcome the 
downward type bias to some extent by the upward weight 
bias. 

Our illustration has included but five commodities. If we 
included a large number of commodities, we might choose as 
our index number the median or the modal price relative. We 
have seen that the mode is hard to determine unless there are 
enough cases so that they can be easily grouped, and that it is not 
always clearly marked even then. For this reason indices are 
seldom based on the mode. The median has the advantage that 
it neglects the extreme cases entirely. But, on the whole, index 
numbers are computed by the arithmetic, geometric, or harmonic 
methods or on the basis of aggregates. 
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14.6. Uses of Index Numbers. —Index numbers can be used 
whenever one wishes to compare changes in groups of values from 
time to time. Their commonest uses are in the measurement of 
changes in the general price level, the cost of living, the rate of 
wages, etc. But whenever groups of values vary and we want 
some single summary figure with which to express the variation, 
the use of index numbers is indicated. Suppose, for example, 
that we want to trace changes in the sanitary conditions in a 
given city. We decide that the healthfulness of the city can be 
determined in part by the infant death rate, in part by the 
percentage of dwellings having modern plumbing, in part by 
the number of absences from the public schools, etc. We 
determine first which things to include. We must then deter¬ 
mine the proper relative weights. The computation of the 
index number is then simple. 

To illustrate, suppose that we are to measure the healthfulness 
of a city by the three items mentioned above. As the conditions 
become more healthful, the infant death rate will presumably 
fall; but we want values which increase with the healthfulness. 
Let us take, therefore the difference between the infant death 
rate and, say, 300; that is, we shall subtract each infant death 
rate from 300. Likewise it is to be expected that the number of 
absences from school would decrease as the healthfulness of the 
city increased. There would also be variations in the number of 
absences according to the number of pupils registered in the 
schools from year to year. Thus our measure might well be 
the average number of days attended per pupil divided by the 
number of days that school was in session. If there were 
700 pupils in the town and they attended an average of 200 days 
each, and if the schools were in session 203 days during the 
year, our measure would be 20 %os ~ 98.5. The attendance 
would be, in other words, 98.5 per cent of the total possible 
attendance. 

Now let us take a hypothetical case. In a given city we have 
the figures on infant mortality, on school attendance, and on 
plumbing, for two successive years. Let us call the years 1920 
and 1921. In 1920 the infant death rate was 105.2, 73 per cent 
of the houses had modern plumbing, and the school attendance 
was 96 per cent of maximum. In the second year the infant 
death rate was 103.4, the school attendance was 95.5 per cent 
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of maximum, and 75 per cent of the houses had modern plumbing 
If we are not to weight our figures, the “index of healthfulness” 
will be computed as in Table 14.5. The arithmetic mean of the 


Table 14.5. —Computation op “ Index op Healthfttlness” 


Measure of Healthfulness 

Value 

Relative 

Value 

1920 

Infant mortality. 

194 8 

100 

Plumbing 

73 0 ! 

100 

School attendance 

! 96 0 

100 

1921 

Infant mortality . 

196.G 

100.9 

Plumbing... . 

75.0 

102 7 

School attendance . . 

95 5 

99 5 


three relatives for 1921 is 101.0. We could say, then, that the 
health index of the city rose from 100 to 101 during the year. 1 

Suppose, however, we decide that a change in the infant death 
rate is ten times as important as school attendance as a healtli 
indicator, and that the plumbing conditions are twice as impor¬ 
tant as school attendance. This assumption would give us 
weights of 10 for the infant death rate, 2 for plumbing, and 1 
for school attendance. If we weight the relatives on this basis, 
we have the figures shown in Table 14.0. The health index for 
1921 is, then, 1313.9/13 - 101.1 


Table 14.6. —Computation of Weighted “Index of Healthfulness” 


Measure of Healthfulness 

Relative 

Value 

Weight 

Product 

1921 

Infant mortality. 

100.9 

10 

1009.0 

Plumbing. . . . 

102.7 

2 

205.4 

School attendance. 

99.5 

1 

99 5 

Totals. 


13 

1313 9 


1 The figures for infant mortality in the table are 194.8 and 196.6. These 
are obtained by subtracting the actual rates of 105.2 and 103.4 from 300, as 
explained in the text. 
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It should be obvious that such an index might be useful in 
comparing different cities at a given time as well as different 
times in the same city. Thus indices may be geographical as 
well as chronological. 1 

14.7. Correcting Prices with Index Numbers. —The farm price 
of potatoes in the United States on Dec. 1 of various years is 
given in Table 14.7. 2 In the same table are given the index num¬ 
bers of the wholesale prices of “all commodities” for these years. 8 

Table 14.7.— Farm Price of Potatoes in the United States on 
December 1 , and Index Number of Wholesale Prices of 
All Commodities, 1916-1925 


Year 

Farm Price 
of Potatoes 
(cents per 
bushel) 

All-commodity 
Price Index 
(1910-1914 « 
100) 

1916 

146 

125 

1917 

123 

172 

1918 

119 

191 

1919 

158 

202 

1920 

113 

226 

1921 

108 

143 

1922 

56 

141 

1923 

76 

147 

1924 

62 

143 

1925 

1S7 

151 


Phis index number is based on the prices of a large number of 
commodities (well over 800 at present) and is here given with 
the average of the years 1910-1914 as a base. Thus the index 
number for 1919 means that wholesale prices were, in general, 

1 This example of an "index of healthfulness" is to be considered by the 
student as something quite as hypothetical as the price index in which we 
worked with commodities A , B, C, D, and E. It may well be that the factors 
here listed are not important as measures of hcalthfulness, and it must be 
that their relative importance is far from that here given. Also the subtrac¬ 
tion from 300, etc., is quite arbitrary. The author apologizes to vital 
statisticians for intruding upon a field about which he is ignorant, but it 
seems worth while to point out to the student that there are uses for the 
methods here discussed other than uses represented in the analysis of prices. 

* From Statistical Abstract , 1933, p. 597. 

•From Farm Economics , Cornell University, September, 1931, pp. 1586- 

1687 . 
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202 per cent of their average 1910-1914 level. They had risen 
to over double their average value for the base period. 

It will be noted from the table that the farm price of potatoes 
was much lower in the years 1922-1924 than it had been before. 
But it is also well known that this was a time when prices were 
falling generally in the deflation after the war. The index num¬ 
bers show that prices reached their postwar peak in 1920 and 
fell sharply during the following year. 

There are at least two things that can account for a change 
in the price of potatoes. In the first place, the potatoes may 
have relatively more or less value in terms of all other com¬ 
modities on account of the size of the potato crop or of changes 
in people’s desires for potatoes. In the second place, there may 
have been changes in the value of money itself. We commonly 
measure the value of other things in money terms, but, as has 
been often pointed out by economists, money is a poor measuring 
stick because it is not constant in value. During and immedi¬ 
ately after the war almost all prices rose. It took more money to 
buy the same goods; the value of money had fallen. Then came 
the break in prices, and money suddenly became more valuable 
again. 

Now we are interested in knowing whether these changes in the 
price of potatoes were due merely to fluctuations in the value of 
money or whether they show some change in the economic posi¬ 
tion of the potatoes themselves. Our price index tells us (in so far 
as it is an accurate measure of changes in the purchasing power 
of money) that it took $1.25 in 1916 to buy what $1 had pur¬ 
chased during 1910-1914. By 1919 it took $2.02 to buy this 
same bill of goods; in other words, money had become much 
less valuable during the period. If the value of potatoes meas¬ 
ured in terms of commodities other than money had remained 
constant throughout this period, the price of potatoes would have 
risen because the value of money had fallen. During this period 
the price of potatoes did rise from $1.46 per bushel to $1.58 per 
bushel. Can this change be accounted for entirely by the change 
in the value of money? We discover the answer by finding the 
corrected prices of potatoes. 

If the dollar had the same purchasing power in 1916 that it 
had in the base period, then it would still have taken $1 to 
buy the goods that were really selling for $1.25 in 1916. In 
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other words, prices would have been reduced from 125 to 100. 
We can do this by dividing prices for 1916 by 1.25. Similarly, 
if we divide the prices for 1917 (when the index number was 172) 
by 1.72, we shall be putting the prices on a basis of dollars with 
the purchasing power that dollars had in the base period. And if 
we divide each price by the index number for the same year 
(remembering that an index number is a percentage, and there¬ 
fore pointing off two places when dividing), we shall be stating 
the price for each year as it would have been had the dollar 
retained a constant purchasing power equal to that which it 
had in the base period. Since the base period was the average 
of the years 1910--1914, we could call these prices “prices in 
1910-1914 dollars” to show that we are talking about prices 
measured in a dollar which supposedly has constant purchasing 
power. It is also common to call such prices corrected prices to 
indicate that they have been corrected to account for changes in 
the value of money. A corrected price, then, is a price which has 
been divided by the index number for the year (or other period 
that the price may represent). 

Since these are Dec. 1 prices of potatoes, it would be better 
for us to correct them with the index numbers for Dec. 1 of the 
years given. But, since it is our purpose merely to illustrate the 
process of correction, we shall not bother to undertake a refined 
analysis of the data. 

We can, then, “correct” the potato prices from Table 14.7, 
page 431, by dividing each price by the index number. This 
would give us the corrected prices. These appear in Table 14.8, 
and the original prices are given with them for purposes of com¬ 
parison. These corrected figures are intended to show the rela¬ 
tive purchasing power of potatoes in terms of other commodities, 
not merely in terms of money. They show that, although the 
price of potatoes fell somewhat between 1920 and 1921, other 
things in general fell more, so that the same quantity of potatoes 
would buy relatively more of other things. In many economic 
problems these corrected prices are far more important than the 
actual prices. Although no index number has ever been devised 
which measures changes in the general price level with absolute 
accuracy and to the satisfaction of everyone, nevertheless correc¬ 
tion of prices by what index numbers we have is certainly much 
better than no correction at all. 
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14.8. The Choice of a Base Period for Index Numbers.—We 

have discovered that the base period is the period with which com¬ 
parisons are made—the period which is taken as 100 per cent and 
from which all the other index numbers are computed. One 
can, of course, select for a base period any period he wishes. We 
could base price indices on the prices which were being paid at 
10 a.m. on Jan. 9, 1935. Usually the base period represents the 
average values for a period of time, such as a year. Surely, if 
we are going to compare all of our index numbers with that of the 

Table 14.8.—The Correction of Potato Prices 


Dec. 1 i 

Farm Price of Potatoes 

Actual 

In 1910-1914 
Dollars 

1916 

146 

117 

1917 

123 

71.5 

1918 

119 

62.4 

1919 

158 

78 3 

1920 

113 

50 0 

1921 

108 

75.6 

1922 

56 

39.7 

1923 

76 

51.7 

1924 

62 

43.4 

1926 

; 187 

124 


base period, it is important to select a base period which is “rep¬ 
resentative”—which is ‘'normal” in some way. One may well 
say that no period is normal, but at least one can understand what 
is meant when it is said that the prices of 1952 were “abnormal.” 
If we were computing a price index, probably we should not 
select a year such as 1952 for the base. 

In addition to choosing a base period which is “normal,” 
there are some advantages in choosing a base period which is not 
too distant. If we are interested primarily in present-day prices, 
there are some disadvantages in comparing always with the 
prices of 1890 or 1913. Here the base period is so far removed 
that there is no reason at all for considering the prices of that 
time as “normal” prices for the present. Moreover, even with¬ 
out any general movement of prices up or down, the scatter 
which would occur in the original prices by pure chance variation 
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would ultimately become so great that the type bias would make 
itself felt strongly. Hence one takes, usually, as a base period a 
“normal” period of the recent past. 

In addition there are advantages in taking as a base period 
some period which is commonly used by others who are com¬ 
puting index numbers, so that your results may be compared 
easily with theirs. Commonly used base periods are the average 
of the years 1910-1914, still used for many agricultural statistics, 
the year 1926, and the average of the years 1935-1939. The 
last is perhaps in most common use at present. 

14.9. Link Relatives and Chain Indices.—Writers sometimes 
use a moving base for their index numbers, hoping to increase the 
accuracy of the index numbers for year-to-year comparisons. 
Although it has been shown that such index numbers give no 
increase in accuracy, but rather the reverse, 1 they do seem to 
offer some advantage when it becomes necessary to change the 
list of commodities included in the index, or to alter their weights. 
Under such circumstances the index for each year (or other time 
period) is computed with the preceding year as a base, using 
whatever price and weight data are available. Similarly the 
index of the third year is computed with the data of the second 
year as a base, the index of the fourth year with the data of the 
third year as a base, etc. The index for each year is thus given 
with the preceding year as 100. Our results might look like 
those of column 2 in the accompanying table. Our first year is 
taken as 100, since we have no preceding year with which to 
compare it. These index numbers, computed on a moving base, 
are called link index numbers. Price relatives so computed 
would be called link relatives . 

These index numbers are related to no common base, but 
usually we wish so to relate them. We do this by “chaining” 
them together in what is called a chain index. It will be noted 
in the table that the link index for 1931 is 92; this means 92 per 
cent of the preceding year. Since the preceding year is 100, we 
have 92 X 100 = 92 (these figures all being percentages, and 
hence having decimal points before the last two digits). In 
1932 the index was 105 per cent of the preceding year, or 105 per 

1 Allyn A. Young, Index Numbers, in “ Handbook of Mathematical 
Statistics,” pp. 183-184, H. L. Rietz, ed., Houghton Mifflin Company, 
Boston, 1924. 
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Year 

Link Index 
Number 

i 

Multiply by 

Chain Index 
Number 

Chain Index 
(1935 « 100) 

1930 

100 


100 

103.6 

1931 

92 

100 

92 

95.3 

1932 

105 

92 

96.6 

100.1 

1933 

102 

96.6 

98.5 

102.1 

1934 

100 

98.5 

98.5 

102.1 

1935 

98 

98 5 

96.5 

100.0 

1936 

96 

96.5 

92.6 

96.0 

1937 

101 

92.6 

93.5 

96.9 

1938 

105 

93.5 

98.2 

101.8 

1939 

104 

98 2 

102.1 

105.8 


cent of 92. This gives us 96.6 for the 1932 index. The index 
for 1933 is 102 per cent of the 1932 index, or 102 per cent of 96.6. 
This gives us 98.5 for the 1933 index. Similarly we chain together 
the indices of the other years, multiplying each link index num¬ 
ber by the chain index number of the preceding year. Our 
resulting chain index is based on 1930. If we wish to use any 
year other than this first year as a base, we can easily convert, as 
has been done in the table. For example, if we prefer a 1935 
base, we can divide each chain index on the 1930 base by 96.5 
(the 1935 chain index on the original base). Thus we have the 
figures in the last column of the table above. 

It is evident that this method of computation can be used even 
though radical changes are made in the commodities covered by 
the index. It is necessary only that we compute our original 
links on comparable sets of data; that is, when we compute the 
link for 1932 it is necessary that we use comparable data for 1931 
and 1932. Likewise, when we compute the link for 1933 it is 
necessary that our data for 1932 and 1933 be comparable. But 
it is not necessary that the data for 1932 be the same in both 
these links. 

We can summarize as follows the method of computing a chain 
index: 

1. Calculate the index for each period with the data of the preceding 
period as a base. 

2. Chain the links together to form a chain index with the first period as 
100. This is done by calling the chain index for the first period 100 and then 
multiplying each link by the chain index of the preceding period. 
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3. Shift the chain index to the desired base by dividing each chain index 
number just found by the chain index number for the desired baBe period. 

14.10. Choosing a Formula for Index Numbers.—The problem 
of computing index numbers is the problem of describing a uni¬ 
verse from a sample. For example, if we wish to compute an 
index number to show changes in the level of the prices of farm 
products, we cannot ordinarily include data on all agricultural 
transactions. We shall be forced to base our index number on 
the prices of only a part of the possible commodities, and even in 
the cases of the commodities included we shall have price quota¬ 
tions on only a few of the actual transactions. But we hope that 
the price movements registered by the commodities and trans¬ 
actions chosen will be typical of the movements of all farm prices. 

We have already seen that some methods of computing index 
numbers will introduce bias into our results—that the method of 
computation itself will make the index number increase or 
decrease in size even though the general level of the prices them¬ 
selves has not changed. Mere increases or decreases in the 
dispersion of these prices will affect the size of the index number. 

In order to eliminate or minimize the various types of bias that 
may arise, we find that many more or less complicated refine¬ 
ments have been introduced in index-number computation or 
suggested in index-number theory. Some of the suggestions are 
in themselves so intricate and time-consuming that they are never 
applied in practice. The workaday statistician is likely to forego 
the time and labor involved unless the size of the correction is 
considerable. 

Professor Irving Fisher has made a careful study of various pro¬ 
posals for computing index numbers 1 and has suggested various 
tests to be applied to any formula to indicate whether or not it 
is satisfactory. The two most important of these he calls the 
lime-reversal test and the factor-reversal test . If an index number 
is to meet the time-reversal test it must be so computed that the 
index number for any year X to the base year Y is the reciprocal 
of the index number for the year Y to the base year X. An 
index number meets the factor-reversal test if a price index and 
a quantity index computed from the same data will yield, when 

1 Ikving Fisher, 14 The Making of Index Numbers/’ Houghton Mifflin 
Company, Boston, 1922. 
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multiplied together, the value index derived from the same data. 
This test likewise requires that the prices and quantities may be 
interchanged without invalidating the test. 

In attempting to meet these tests and others which have been 
suggested, various statisticians have suggested a multitude of 
formulas. Most of these are much more complicated than the 
methods we have illustrated in this elementary test. We can, 
however, illustrate their complexity and their general nature 
best by giving one or two of the formulas which seem best adapted 
in theory to meet the tests. 

If we let the price of a commodity in the base year be repre¬ 
sented by p 0 , w r hile the price in any other given year is p,; and if 
we let the weight in the base year be q 0 and in the given year q, } 
Fisher concludes that the “ideal” index number would be found 
by means of this formula: 


Index = 



Another formula, the results of which are in practice almost 
identical with those of the “ideal” formula, is the aggregative 
formula of Marshall and Edgeworth, which follows: 


Index = 


S(go + g.)p. 
2(?o + g.)p„ 


The computations involved with this formula are much simpler 
and shorter than those with the ideal formula, and the slight 
difference in results would seldom make the application of the 
“ideal” formula worth while. 

It is evident that the simple methods discussed earlier in this 
chapter can all be described by means of formulas. For example, 
the simple arithmetic mean of relatives is found thus: 


Index * 



The simple aggregative index number would b# 

Index * 

2p. 
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1411. Selection of Basic Data. —It will be recalled 1 that the 
reliability of an arithmetic mean varies, not in proportion to the 
number of cases on which it is based, but in proportion to the 
square root of that number. Since an index number is a sort of 
average, showing the typical movement among a whole class of 
cases, we should remember that its reliability, too, can be 
expected to vary roughly with the square root of the number of 
items on which the index is based if these items are of equal 
importance. Usually, however, it is possible to pick out some 
items that are far more important than others, and if we choose 
them first and give them proper weight, we soon come to a point 
where the remaining items are unimportant enough and the cost 
of collecting them great enough so that we are warranted in neg¬ 
lecting them. While some indices in actual practical use are 
based on data concerning several hundred commodities, 2 if the 
items selected are judiciously chosen it should be possible to com¬ 
pute a worth-while index on a relatively small number. 

Troublesome always in selection of basic data is the problem of 
getting quotations that are comparable. Even with simple 
staples, it is difficult to compare the packaged butter or flour of 
today with the bulk commodities of a half-century ago. The 
problem becomes more difficult still when we are forced to deal 
with nonstandardized things such as women’s dresses or enter¬ 
tainment, either one of which might well be included in an index 
of the cost of living. And when we come to commodities which 
may now be important but which formerly were not used at all, 
such as radios, the problem becomes very difficult indeed. No 
general rules can be laid down for such cases, and again it is neces¬ 
sary to emphasize the fact that good statistical work is largely 
nonmathematical in character, involving the use of good judg¬ 
ment by someone w r ho is thoroughly familiar with the facts in 
his own field as well as with the technical statistical procedures. 

14 . 12 . Suggestions for Further Reading.—The student who wishes to go 
further in the field of index numbers will find Irving Fisher’s “The Making 
of Index Numbers,” Houghton Mifflin Company, Boston, 1922, required 
reading. It is voluminous, but it is simply and understandably written, 

1 See Sec. 9.2, p. 240. 

1 Perhaps the most useful of all indices published is the index number of 
the wholesale prices of “all” commodities, published by the U.S. Bureau of 
Labor Statistics. This index is based on over 800 commodities. 
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and is probably the outstanding book in this field. For a short but very 
satisfactory treatment, see William L. Crum and Alson C. Patton, “An 
Introduction to the Methods of Economic Statistics/’ Chaps. XVIII and 
XIX, McGraw-Hill Book Company, Inc., New York, 1025. Willford I. 
King, “Index Numbers Elucidated/’ Longmans, Green & Co., Inc., New 
York, 1930, is a valuable small book on the subject. King does not agree 
with Fisher’s ideas on the subject of an “ideal” index number, concluding 
that the formula which is “best” depends on the problem at hand rather 
than on mathematical considerations. A useful description of a few of the 
leading current index numbers, followed by a very brief description of 45 such 
current published indices, may be found in Frederick E. Croxton and Dudley 
J. Cowden, “Practical Business Statistics,” Chap. XVIII, Prentice-Hall, 
Inc., New York, 1934. Allyn A. Young’s chapter on Index Numbers which 
appears as Chap. XII of the “Handbook of Mathematical Statistics,” 
edited by Henry L. Rietz, Houghton Mifflin Company, Boston, 1924, is very 
short and very good. 

EXERCISES 

1. Check the computation of the indices given on page 423, showing your 
methods. 

2 . How would you go about the process of computing the weighted har¬ 
monic mean of the relatives from which we computed the weighted geometric 
mean ou page 426? 

8. On pages 429-431 is described the computation of an “index of health¬ 
fulness.” Suppose that m 1922 the infant mortality rate was 95, the school 
attendance was 99 per cent of maximum, and 80 per cent of the dwellings 
enjoyed modern plumbing Compute the weighted average index of health¬ 
fulness for 1922 to compare with that computed in the text for 1921 (see 
page 430). 

4 . Using the fictitious commodities of the example that begins on page 421, 
suppose that the prices of oui five commodities in 1936 are 

A $1.20 
B 0.45 

C 0.70 

D 0.95 

E l 00 

Compute the weighted arithmetic average index number for 1936 to compare 
with that computed for 1935 on page 426. Compute the weighted geo¬ 
metric average index number for 1936 to compare with that for 1935 com¬ 
puted on page 427. 

6. In 1935 the index number of the wholesale price of all commodities 
was 117, while the index number of the cost of living in the United States 
was 140. Both numbers are from publications of the U.S. Department of 
Labor, and both are on a basis of the average figures for the years 1910 
through 1914 (that is, 1910-1914 = 100). Compare the two figures, and 
comment. 
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6. We hear a good deal about the virtues of a “random sample. 7 ’ If you 
were selecting items to be included in the computation of an index number, 
would you select items at random, or would you select them according to a 
plan? In either case, why? 

7. To illustrate how index numbers can be used in fields other than 
economics and business, outline the items that you would include in an 
index number to be used for comparing the scholastic standings of various 
colleges and universities. Select things that can be numerically expressed. 

8 . Give what you think are approximately correct weights to the items 
that you have enumerated in the preceding exercise. 

9. When comparing nontemporal phenomena, as in Exercise 7 above, 
there is no “base period.” How would you select the base for your index 
number in such a case? 

10. Table 14.9 gives the United States production of petroleum, Pennsyl¬ 
vania anthracite coal, bituminous coal, and coke for the years 1930 through 
1939. The value of the outputs of the four fuels in millions of dollars in 
1939 was 1 


Petroleum . . 

$1265 

Anthracite 

187 

Bituminous 

733 

Coke 

213 


Compute a weighted index number of the volume of fuel production in the 
United States for each of these 10 years, using the values as weights and 
using an arithmetic average of the relatives. Use 1935 as a base. 


Table 14.9.— United States Output of Certain Fuels, 1930-1939 


Year 

Petroleum 
(millions 
of barrels; 

Anthracite 
(millions 
of tons) 

Bituminous 
(millions 
of tons) 

Coke 
(millions 
of tons) 

1930 

898 

69.4 

468 

48.0 

1931 

851 

59 6 

382 

33 5 

1932 

785 

49.9 

310 

21.8 

1933 

906 

49.5 

334 

27 6 

1934 

908 

57 2 

359 

31 8 

1935 

997 

52 2 

372 

35 1 

1936 

1100 

54 6 

439 

46.3 

1937 

1279 

51.9 

446 

52.4 

1938 

1214 

46.1 

349 

32.5 

1939 

1264 

51.5 

393 

44.3 


All data for this exercise from “World Almanac,” 1941, pp. 601-602. 



CHAPTER XV 


SIMPLE CORRELATION 

In Chap. XI we discussed the nature of relationship and 
described various simple ways of determining whether or not two 
variables are related. We have become familiar with the use of 
group averages and of scatter diagrams, and with the fitting of 
straight lines and curves to scatter diagrams in order to get some 
way of estimating values of the dependent variable. We shall 
now push these matters a little farther to discover and interpret 
certain values closely related to those which we discovered 
earlier. 

15.1. Errors of Estimate. —Figures 11.1 and 11.2 on pages 300 
and 301 both show scatter diagrams giving a picture of the rela¬ 
tionship between age and blood pressure. In Fig. 11.1 the points 
are rather widely scattered, and if we fit any straight line or 
simple curve to the diagram, many of the points will lie far from 
the line. In Fig. 11.2, in contrast, the points lie in a very narrow 
band, and it is easy to fit a straight line which will fail very 
close to every one of them. If we use our lines as a basis for 
estimating blood pressures, many of the actual blood pressures 
of Fig. 11.1 will differ a good deal from our estimates. We shall 
have rather large errors of estimate. In Fig. 11.2, however, the 
differences between our estimates and the actual values will be 
small. We shall have small errors of estimate. 

The dots on a scatter diagram show the actual values of Y for 
various values of X ; but the straight line or curve which we com¬ 
pute and add to the diagram is our means of estimating values 
of Y for various values of X. Our estimates all lie exactly on 
the line, but the dots do not. They are more or less widely 
scattered around the line. If the band of dots is a narrow and 
compact one, as in Fig. 11.2, our estimates will all be close to the 
points, and our errors will be small. If the band of dots is a wide 
and loose one, as in Fig. 11.1, many of our estimates will contain 
large errors. Figure 15.1 repeats the dots of Fig. 11.1, and shows 

442 
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the linear regression line fitted by the method of least squares 
in Sec. 11.9. The points do not fall on the line, and the vertical 
distance of each dot from the line has been indicated. These 
vertical lines, joining the dots to the regression line, represent 
the errors of estimate. If we are asked to estimate the blood 
pressure of a person aged 78 years, we use our regression formula 
and estimate a blood pressure of 140.4 mm. But when we look 
back at Table 11.7 on page 313 we find that actually the only 
person of that age in the table had a blood pressure of 160. Our 
estimate of his blood pressure was in error by 19.6 mm. 



Fig. 15.1. Showing errors in estimating blood pressure from age by use of 
least-square linear regression line. 

When we are studying relationships we want not only some 
formula for estimating values of the dependent variable, but also 
some way of knowing how accurate those estimates are likely 
to be. If we were trying to guess blood pressure without any 
knowledge of age we would not, of course, pick a number at 
random. We would not, for example, guess a blood pressure of 
300 nor one of 17. In the absence of any other information we 
would be wise to guess an average blood pressure of 127 mm. 
And under such circumstances the amount of our error would 
depend on the scatter of the individual blood pressures. If in 
practice all blood pressures are fairly uniform, all concentrated 
close to the average, our guess will involve little error. But if 
the original blood pressures are widely scattered, our guesses will 
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fall far afield for at least some of them. We know that in such 
cases we can get a good measure of the amount of our error by 
computing the standard deviation of the blood pressures (see 
Chap. VI). Trial will show that the standard deviation of the 
blood pressures in Table 11.7 is 11.5 mm. We recall that this 
indicates that if the values are normally distributed about two- 
thirds of the blood pressures will lie within 11.5 of the mean, 
about 95 per cent within 23.0 of the mean, and practically all 
within 34.5 of the mean. Therefore if we guess blood pressures 
at the mean value our errors will be less than 11.5 two-thirds of 
the time and practically never will they exceed 34.5. 

But can we do better than to guess at the mean? If age and 
blood pressure are related, it means that a knowledge of age is 
helpful to us in estimating blood pressure—that we can estimate 
blood pressures more accurately on the basis of age than we can 
by plain unassisted guesswork. We can see at once how much 
assistance the regression line is in estimating blood pressures by 
comparing the actual errors in Fig. 15.1 with the errors which we 
would get in pure guesswork, where, as we have just seen, we 
should be able to come within 11.5 mm. two-thirds of the time. 

Table 15.1 shows the actual blood pressures of the 21 persons 
listed in Table 11.7, and beside them shows the blood pressures 
which we would estimate by using the least-squares linear regres¬ 
sion equation Y = 117 + 0.3X. The difference between the 
actual and estimated values is the error in estimate, and these 
errors appear in the last column. If, using familiar methods, we 
compute the standard deviation of these errors, 1 we find that it 
amounts to 9.85 mm. But this is like any other standard devia¬ 
tion, and should be so interpreted. It tells us that if the errors 
are normally distributed about two-thirds of them will lie within 
9.85 of the mean, and almost never will one lie more than 29.55 
from the mean. When we used plain guesswork we could make 
estimates with an error of 11.5 mm. two-thirds of the time. 
With the regression line we can get our error down to 9.85 mm. 
two-thirds of the time. It is apparent that the regression line 
has made it possible for us to reduce our error slightly, although 
not very much. A knowledge of the ages of the people in 

1 When a line is fitted by the method of least squares, the sum of the errors 
or residuals is always zero. In Table 15.1 the sum of the last column is 0.7 
because we used rounded values. 
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Table 15.1 has helped us in our estimates of their blood pressure, 
but has still left a large proportion of the error that there would 
have been had we guessed at their blood pressures, guessing a 
mean blood pressure for each of them. 

We have just computed the standard deviation of the errors 
of estimate—the standard deviation of the residuals around the 

Table 15.1.—Actual and Estimated Blood Pressures with Errors 

of Estimate 


Name 

Blood Pressure 

Error 

Actual 

Estimated 

Ruby 

108 

120 6 

12 6 

Betsey 

120 

120.6 

0.6 

Muriel 

128 

120 6 

-7,4 

Mary 

116 

120 9 

4.9 

Alice 

124 

122 1 

-1 9 

Frank 

116 

122.4 

6 4 

Walter 

130 

124 2 

-5.8 

Dorothy 

130 

124 5 

-5 5 

Frederick 

124 

124 8 

0 8 

Esther. 

126 ! 

125 1 

-0.9 

Sidney 

130 

125 4 

-4.6 

Albert 

140 

126.9 

-13 1 

Edith 

130 

127 5 

-2 5 

John 

140 

127 5 

-12.5 

Robert 

120 

130.2 

10.2 

Dan 

126 

130.8 

4.8 

Priscilla 

130 

133 5 

3.5 

Donald 

120 

134 1 

14.1 

James 

114 

135 0 

21.0 

Peter 

140 

135.6 

-4.4 

Ralph 

160 

140 4 

-19.6 


regression line. We know that the sum of the squared residuals 
(that is, errors) is less around this line than around any other 
straight line, so that the standard deviation of the residuals, being 
based on the sum of the squared residuals, must also be smaller 
around this line than around any other straight line. This 
standard deviation of our errors of estimate we call the standard 
error of estimate , and we represent it by the letter S. In this case 
we have been estimating values of Y from values of X , and hence 
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we can call this particular standard error the standard error of 
estimating Y from X, To distinguish it from other standard 
errors of estimate, we give it subscripts, thus: S y . z . This is read, 
“The standard error of estimating Y from X” Where it will 
cause no confusion, we commonly use the symbol S v alone. 

16.2. Correlation.—We have seen that it is helpful to compare 
the values of S y and oy In the case we have just discussed, the 
value of <r y was 11.5 and the value of S v was 9.85. Thus we knew 
that we could estimate by means of the regression equation as 
many of the cases within 9.85 of the actual figures as we could 
estimate within 11.5 of those figures without the regression 
equation. It is evident that if a knowledge of one's age 
were of no use to us in estimating his blood pressure, then S y 
would be as great as oy Whenever S y is smaller than ay there 
has been some advantage in using the regression equation. In 
other words, a knowledge of the one variable has improved our 
estimates of the value of the other Variable. But this is just 
what we said determines the existence of relationship between 
variables (see page 294). We say that variables are related when 
a knowledge of the value of one helps us in estimating the value 
of the other. And when S v is smaller than a v we know that a 
knowledge of the value of one variable does help us in estimating 
the value of the other. Hence we can say that if S y = a y there 
is no relationship—no correlation between the variables. But 
if S y is smaller than a v correlation exists. 

This at once leads us to the conclusion that we might measure 
the degree of relationship by the difference between S v and ay 
Yet here there is an immediate difficulty. Suppose we are meas¬ 
uring the relationship of height to weight, estimating the latter 
from the former. S y and cr y will both be in terms of pounds, and 
the difference between them will be in pounds. Had our original 
figures been in ounces, the value of S y and a v would both have 
been 16 times as great, and the difference between them would 
likewise have been 16 times as great. Yet there would be no 
more relationship between height and weight when the weights 
were stated in ounces than when they were stated in pounds. 
Obviously we need some measure of the degree of relationship 
which is independent of the units in which the problem is stated. 

This leads us to suggest another possible measure of correlation, 
namely, the ratio of S y to a v . Both S y and <r v will always be in the 
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same units, one of them being in pounds if the other is. Multi¬ 
plying both by 16 will not affect their ratio. Hence we could use 
as a measure of the degree of correlation their ratio, or S v /<r v . In 
practice, statisticians have preferred to use the ratio of the 
squares of these numbers, that is, S y 2 /a y 2 ] in other words, they 
compare the variance of the errors of estimate with the variance 
of the original figures. 1 

But one minor difficulty still exists with this measure. The 
greater the relationship between the variables, the smaller will be 
S v as compared with <r y . If we could estimate with perfect accu¬ 
racy, making no error in any case, it would mean that all the 
points on our scatter diagram fell on the regression line. There 
would be no errors of estimate, and S v would equal 0. On the 
other hand, if there were absolutely no relationship between our 
two variables, S y would be as great as <r v . These are the limiting 
cases. Perfect correlation would mean that S v 2 /a y 2 was equal to 
0, and the entire lack of correlation would mean that S y 2 /(r y 2 was 
equal to 1. But it is preferable to have a measure which shows 
the amount of the relationship directly by its size. We should 
like to have close relationships expressed by large coefficients and 
lack of relationship expressed by small coefficients. This can be 
done by subtracting our ratio from 1, to get 1 — {S v 2 /a y 2 ). And 
we usually take the square root of this to put it back into the first 
degree, since S v and a v were originally squared. In this case our 
measure of correlation will be defined by the formula 



The letter r always represents i lie coefficient of correlation around 
a straight line fitted by the method of least squares. In other 
words, if we are making our estimates from a line of the form 
Y = a + bX , and if this line was fitted by the method of least 
squares, then r is defined as indicated and is known as the 
coffiecient of simple linear correlation 2 People often speak of it as 
the coefficient of correlation, without bothering with the rest of 

1 We recall that the square of the standard deviation is called the variance. 
Here, as in Chap. X, we are comparing two variances. 

* This coefficient is also commonly known as the Pearsonian coefficient of 
correlation after the great English statistician, Karl Pearson, who did much 
of the early work in the field of correlation. 
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the title. It is true that there are many other types of coefficients 
of correlation, and it is safer to specify the one in question. If 
one uses the letter r, however, it is always understood that this 
is the coefficient to which reference is made. 

We have already computed all the values necessary to find the 
value of r. If we substitute them in the foregoing formula, we 
find that 

l~ 9S5 2 __ , n R1K 

r \ 11.5 2 

Since S v can never be larger than it is evident that the ratio 
S v 2 /(r y 2 can never be greater than unity. Hence the value of 
unity minus this ratio can never be negative, and the square root 
can always be taken. As when any other square root is taken, 
the result may be either positive or negative; that is, \/4 — ±2. 
Here, then, we can call our result 4-0.515 or —0.515. Jtis the 
accepted practice of statisticians to give to r the same sign that is 
found for b in the regression equation. In this case we found that 
b = 4-0.3 (see page 313). Hence we give r a plus sign and 

r = 4-0.515 

The sign means that the relationship is direct or positive as these 
terms were defined on page 294. 

The smallest possible value of r would occur if there were no 
relationship at all between the variables—that is, if a knowledge 
of one helped us not at all in estimating the value of the other. 
In this case S y would be as great as a y (there would be as much 
error in using the regression equation as in neglecting it), and 
hence S y /<r y would be equal to 1. Hence 

r = vT^T = 0 

When there is no relationship between the variables, then, r = 0. 

The largest possible value of r would occur if the relationship 
between the variables were perfect—if the points on the scatter 
diagram all fell on the straight regression line, so that we could 
estimate each dot with entire accuracy from the regression equa¬ 
tion. In this case the errors of estimate would be nonexistent. 
S y would equal 0, and hence S v /<x y would equal 0. This would 
give us 


r = Vl - 0 = ±1 
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When there is a perfect linear relationship between two variables, 
r = ± 1. If large values of the one variable are associated with 
large values of the other and the relationship is perfect, r = +1. 
If large values of the one variable are associated with small values 
of the other and the relationship is perfect, r = — 1. These are 
the limiting values of the coefficient. A value of r beyond the 
range from —1 to +1 means a mistake in computation. Such a 
coefficient cannot exist in fact, on account of the nature of the 
definition. 

The student should be warned at this point that a value for r of 
0 does not mean that no relationship exists. We have said that 
if no relationship exists the value of r will be 0, but the statement 
cannot be turned around and still retain its validity. The 
coefficient of simple linear correlation measures the degree to 
which a straight line describes the relationship between the varia¬ 
bles. It is quite possible for a close relationship to exist between 
two variables and to be nonlinear in nature. Attention is again 
called to the table on page 297 in which one can see a relationship 
between the variables but in which the relationship cannot be 
described by a straight line. In such a case the value of r might 
be very small, and even equal to 0. This would mean merely that 
if there were any relationship it could not be described by a 
straight line. Let us make, then, the two statements that can be 
correctly made, so that the difference between them can be noted: 

1. If there is no relationship between two variables, the value of r will be 0. 

2. If the value of r between two variables is 0, either there is no relation¬ 
ship between them or whatever relationship exists cannot be described by a 
straight line. 

Failure to realize the qualification in the second statement has 
led many a careless person to state that no relationship existed 
in his researches when, as a matter of fact, his researches had 
not proved the nonexistence of relationship. Computation of 
the coefficient of correlation can show that relationship (as herein 
defined 1 ) does exist, but it cannot show that it does not exist. A 
low coefficient of correlation shows merely that if a relationship 
does exist one has not yet found it. 2 

1 See p. 291. 

2 As Charles Kingsley says in his book “ Water Babies,” u There are no 
such things as water-babies? How do you know that? Have you been 
there to see? And if you had been there to see, and had seen none, that 
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15,3* Degrees of Freedom. —If we place but two points on 
a scatter diagram we can always find a straight line which will 
pass through both of them without any error. Therefore if we 
had studied the ages and blood pressures of but two people we 
could have found a straight-line formula for estimating their 
blood pressures from their ages without any error at all. But 
this would not mean that there was really any such close connec¬ 
tion between age and blood pressure. If we had computed such 
a formula, and then tried to use it in estimating the blood pressure 
of a third person, we would almost certainly have error. The 
fact that we can always find a straight line to pass through two 
points means that the solution of our normal equations for two 
observations is trivial. It does not really tell us anything at all 
about the underlying relationship. We have forced our line to 
agree with the points, and have lost two degrees of freedom. For 
this reason in correlation work we find it necessary to “ correct ” 
our results when we have used small numbers of cases, since with 
small samples there is a tendency for r to be greater than the 
actual r of the universe and for S v to be smaller than the actual 
S v of the universe. We overestimate the amount of relationship, 
and underestimate the amount of our error. If we let S f y repre¬ 
sent a corrected standard error of estimate, and r' a corrected 
coefficient of simple linear correlation, we can write 

<«■ - <*«>’ (^ 4 ) 

-1 - a - o (^-‘) 

In our illustrative problem we have the following results: 

S y = 9.85 

r - +0.515 

n = 21 (the number of cases) 


would not prove that there were none. If Mr. Garth does not find a fox in 
Eversley Wood . . . that does not prove that there are no such things as 
foxes. . . . And no one has a right to say that no water-babies exist, till 
they have seen no water-babies existing; which is quite a different thing, 
mind, from not seeing water-babies. . . . ” 

We can paraphrase Mr. Kingsley: “And no one has a right to say that 
no relationship exists, till they have seen no relationship existing; which is 
quite a different thing, mind, from not seeing relationship." 
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If we substitute these values in the formulas for the corrected 
values of S v and r we get 

(Sir = (9.85 2 )( 2 y 19 ) - 102.1 
S' = 10.01 

(r') 2 - 1 - (1 - 0.515 2 )( 2 % 9 ) = 0.227 
r' = ±0.476 

Since r was positive, we make r' positive, and call it +0.476. 
Thus r' is always given the same sign as r. If the value of (r / ) 2 
computed by the formula turns out to be negative, the value of 
r' should be given as 0. 

It should be noted that allowance for degrees of freedom always 
increases the value of S„ and reduces the value of r. In this case 
r was reduced from 0.515 to 0.476, which is a reduction of about 
7 or 8 per cent. We shall be better able to judge the significance 
of such a change when we have learned how to interpret values 
of r. 

16.4. Standard Error of Correlation Coefficient.—In Chap. IX 
we learned that, although one can calculate the average of a 
group of figures, one can seldom calculate the average of the 
universe in which one is interested. We can, however, estimate 
the standard deviation of the means of samples drawn from the 
universe. This we called the standard error of the mean. Simi¬ 
larly we should like to compute the standard error of the coeffi¬ 
cient of simple linear correlation; that is, we should like to 
estimate the standard deviation of the r’s of an infinite number 
of samples drawn from the same universe. In our sample we 
found that r = +0.515. Had we selected another sample of 
21 people, it is probable that we should have found a slightly 
different value of r. Had we selected 100 such samples, we 
should have found many values of r. Would these values have 
been widely scattered, or would they all have fallen close to the 
value +0.515 which we found? The standard error of r would 
be our estimate of the dispersion which would occur in the r’s of 
these many samples. 

Unfortunately, however, it has now been shown that if we 
select many samples from the same universe and compute the 
value of r for each sample, the values we get will not be normally 
distributed unless the size of the sample is very large and the 
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degree of relationship (the size of r) moderate or Bmall. With 
small samples or with large values of r, the distributions are 
badly skewed, and the shape of the distributions varies radically 
for different values of r. Such non-normal distributions cannot 
be accurately described by standard deviations, and consequently 
the standard error of r would give us only a very rough idea of the 
distribution unless the sample was large and the value of r small. 
Textbooks usually give the formula for finding this standard 
error as follows: 

1 - r 2 

* r Vn 


If we were to apply the formula to our problem of blood pressures, 
substituting +0.515forr and 21 for N, we should get 


= 1 - 0.515 2 

V2l 


= 0.161 


However, our present example is very poorly adapted for this 
formula because of its small size (A 7 = 21). Therefore we should 
not rely on the standard error just computed. 

15.5. The z-transformation.—In spite of the difficulties that 
stand in the way of using the standard error of r, we can fortu¬ 
nately transform any given r in such a way as to derive a coeffi¬ 
cient which is normally distributed even in small samples and 
even when r is large. This method, worked out by R. A. Fisher, 1 
seems involved at first glance, but experiment with an example 
or two shows that its application is in fact very simple. To apply 
the method we must first find the new function z as follows: 


2 z = log e (1 + r) — log* (1 — r) 

This formula would prove troublesome in practice to students 
who were not familiar with natural logarithms. It can be stated 
in terms of common logarithms thus: 

2 = 1.1513 logio (j-t-j) 

1 For a more complete discussion of problems of small samples, the reader 
should consult R. A. Fisher, “Statistical Methods for Research Workers,” 
Oliver & Boyd, Edinburgh and London, 1932. 
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To apply this to our case of blood pressures, where r — +0.515, 
we get the following: 

1 + r = 1.515 

1 - r = 0.485 

14 - 7 " 

- 3.125 

1 — r 

log 3.125 = 0.49485 
(1.1513) (0.49485) = 0.570 » 2 

Fortunately, however, we do not even have to bother with this 
calculation, because the values of z for most common values of 
r have been computed and tabulated. By consulting the table in 
Appendix IV we can find immediately the value of z correspond¬ 
ing to almost any desired value of r, or the value of r for almost 
any value of 2 . In the present case the table shows us that when 
r = 0.515, z is almost exactly 0.57. 

Fortunately this coefficient which we call z seems to be nor¬ 
mally distributed (or approximately so) for samples drawn from 
almost every kind of universe. Hence the standard error of this 
measure will give us an accurate description of its distribution. 
The standard error of z is easily found from the simple formula 

1 

\/n — 3 

In our present problem since n = 21, we get 
a. = —= 0.236 

V18 

We can now interpret the significance of our correlation as fol¬ 
lows: In our sample we found a 2 value of 0.57. We should 
almost never expect the value of z in any sample drawn at random 
from a given universe to differ from the actual z value of the 
universe by more than 3a z . Therefore we feel confident that the 
z value of the universe is within 3cr z = 0.708 of the z value which 
we have found, 0.570. Hence the z value of the universe may 
lie anywhere between 0.570 — 0.708 and 0.570 + 0.708; or 
between —0.138 and +1.278. Referring again to Appendix IV, 
we find that these values of z correspond to correlation coefficients 
of below zero on the one hand and about 0.85 on the other. We 
conclude, therefore, that while our small sample showed a corre- 
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lation coefficient of +0.515, in the universe from which the sample 
was drawn the relationship may have been anywhere from 
nothing at all up to 0.85. Our sample has been too small to 
tell us very much that we can depend on. We are not even sure 
that the relationship is positive. Of course, two-thirds of the 
z’a of other samples would be expected to fall within 0.236 of 
0.570; or between 0.334 and 0.806. We find from Appendix IV 
that these correspond to r values of +0.32 and +0.67. The 
chances are thus about 2 to 1 that the value of r in the universe 
lies between these values. 

It will be seen from our present example that the common 
method of computing r and its standard error (or its probable 
error) by the ordinary formula tends to exaggerate greatly the 
reliability of the coefficient of correlation. One should, rather, 
perform the seemingly more complicated process (it is actually no 
more lengthy when tables are used) of transforming r values to z 
values and testing the significance of the results by means of the 
methods we have just described. To summarize the method: 

1. Having found r and n from the sample, find the value of z corresponding 
to the computed value of r. (This is carried out directly from the table in 
Appendix IV.) 

2. Compute the standard error of z by the formula 

1 

a, = - - >. 

y/n — 3 

3. Lay off the area within which the z of the universe will almost certainly 
be found. The limits of this area are arbitrary, but we have been using 
three standard errors in each solution. Therefore we add 3<r, to the value 
of z found above and subtract 3<r r from the value of z found above to get our 
two limiting z values. 

4. Find from the table in Appendix IV the r values corresponding to the 
two z values just discovered. These are the limiting r values, between which 
we feel reasonably certain that the r of the universe will be found. 

16.6. Actual Computations.—The methods we have developed 
for computing r and S„ have been selected for presentation 
because they help to show the meaning of correlation. They are 
not the methods which would ordinarily be used in practice, 
because other methods require less arithmetical work. We shall, 
then, present other formulas which make the work of computa¬ 
tion easier and which give the same results, and for purposes of 
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illustration we shall work out a type problem by these simpler 
methods. 

First we give four formulas which are algebraically equal and 
from any one of which one can compute the value of r. The first 
is the formula with which we are already familiar. In these 
formulas the symbols used are the same as those we have used 
before except that the mean of the X y a is denoted by M x instead 
of by X. 1 


( 1 ) 

( 2 ) 

(3) 

(4) 



r = 

N(r x cr v 

2(xy) _ 

r ~ 

_S (xr) - N jM JjM,) 


It will be noted that formulas (2) and (3) are stated in terms of 
deviations from the mean, as is shown by the small letters. If we 
were to compute r from formula (3), for example, we should com¬ 
pute the deviation of each X and of each Y from its average, 
multiply together these deviations, and sum the products. The 
result would be the numerator of the formula. We should then 
square the deviations and add them for the X's and for the F’s. 
We should multiply the two sums together and take the square 
root. The result would be the denominator. , When the numer¬ 
ator was divided by the denominator, the result would be r. 
In this case the result may turn out to be either positive or nega- 

1 The value of r may also he computed from the two coefficients of regres¬ 
sion described on p. 330. The relationship is this 

r = v )( byz ) 

Obviously if the two regression coefficients are reciprocals of one another 
(as they would be if the two regression lines coincided), the value of r will 
be 1.00; that is, the regression line of Y on X and that of X on Y coincide 
when the correlation is perfect. 

The above relationship makes it evident that if we know the value of b vx 
and the value of r (as we usually do when we have carried out a correlation 
problem), the value of b xv can be found from the formula without long 
computation. 
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tive. It is not necessary to give r the sign of 6 in the regression 
equation, because it will already have the correct sign as a result 
of the computations. In fact, this is true whether one uses for¬ 
mula (2), (3), or (4). It is only when we use formula (1) that we 
need to look to the sign of b to know the sign of r. 

Formula (2) is probably the best known of the formulas for r, 
yet in most cases it is not the easiest to use. Ordinarily formula 
(4) will give the result more quickly and with less mathematical 
drudgery than any other formula. When this formula is used, 
it should be remembered that M y means the mean of the original 
Y’ s—not the mean of the deviations from the mean. The sub¬ 
script is a small letter, but nevertheless it refers to the original 
Y’ s. Of course, the mean of the y’s would be 0, as would the 
mean of the x y s. The means of the Y ’s and of the X’b are the 
values referred to. 

Formula (4) gives simple, shorthand directions for the com¬ 
putation of r. When the directions are put into this form, they 
are much more concise than they could otherwise be. To make 
sure that they are not misunderstood (and at the same time to 
show what a saving there is in the statement as given in the 
formula), we can translate this mathematical statement of the 
directions into the following longer version, which directs what 
steps should be taken and in what order: 

1. Multiply each X by the corresponding Y (letting X be the independ¬ 
ent and Y the dependent variable, as described on page 328). 

2. Add the products. 

3. Compute the average of the X’s. 

4. Compute the average of the F’s. 

5. Count the number of cases. 

6. Multiply the answer to number 3 above by the answer to number 4, 
and multiply the product by the answer to number 5. 

7. Subtract the answer to number G above from the answer to number 2. 
This will give the numerator of the formula, and is equal to S(xy) of formulas 
(2) and (3). 

8. Square each X and add the squares. 

9. Multiply the answer to number 5 above by the square of the answer 
to number 3. 

10. Subtract the answer to number 9 from the answer to number 8. 

11. Square each Y and add the squares. 

12. Multiply the answer to number 5 by the square of the answer to 
number 4. 

13. Subtract the answer to number 12 from the answer to number 11. 

14. Multiply together the answers to number 10 and number 13. 
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15. Take the square root of the answer to number 14. The result is the 
denominator of the formula. 

16. Divide the answer to number 7 by the answer to number 16. The 
result is the value of r. 

Note what long and corn plicated directions these make when 
written out even in this brief form, and contrast them with the 
simple, short, but identical directions of formula (4). 

15.7. Computation of Regression Equations. —Common for¬ 
mulas for the regression equation are as follows: 

( 1 ) 

( 2 ) 

(3) Y 

Again it will be noted that the first two of these formulas require 
that the X’s and the F’s be reduced to deviations from their 
means, while formula (3) is in terms of the original figures. We 
have already learned that the regression equation can be found 
by solving the two normal equations 

(4) Na + bXX « 2F 
aSX + bSX 2 = 2XF 

This fourth method gives results which are identical with those 
obtained from the other three methods, and it is often the most 
convenient and easy to use. Usually one will find that either 
formula (3) or the normal equations of method (4) will require 
the least time and work. 

15.8. Computing the Standard Error of Estimate. — On page 
444 we computed the standard error of estimate by solving the 
regression equation for each value of X", finding the difference 
between the estimated and the actual values of F, and computing 
the standard deviation of these differences. This procedure 
would require the expenditure of a very large amount of time, and 
in practice S u is never so computed. It is important that the 
student understand that S u is the standard deviation of the errors 
of estimate; it is for that reason that we so computed it before. 
But when one once understands it, there is no longer any reason 
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for taking the long method of discovering the value of S w . As a 
matter of fact, this value is almost always discovered from the 


relationship 



One can see easily whence this formula comes. We know (see 
page 447) that 

S 2 

„2 — 1 

r = 1 - v 

r 2 = gy 8 ~ Sy 2 

rW = ~ &V 

S v * = a 2 - - r 2 ) 

Sy = <Ty \/1 ~ T 2 


This is our formula for the standard error of estimate given above. 
In working out the value of r, we either compute the value of <r v 
or almost do so, depending on the formula used. If we work with 
formula (1) on page 455 we have it all computed; this is also 
true if we use formula (2). In formula (3) we computed the 
value of 2y 2 , and a v ** y/2y 2 /N. Hence we can easily com¬ 
pute a v . If we use formula (4), we compute the value of 
(2F 2 — NM V 2 ). This is equal to Sy 2 , and therefore the value of 
a y may be computed just as when formula (3) is used. One 
merely divides by N and takes the square root. 

15.9. Illustrative Problem.—We now apply these methods to 
an actual problem to see how they work in practice. Suppose 
that we wish to determine whether or not relationship exists 
between the production of potatoes and their price. We shall 
first work out the correlation problem with the original figures, 
uncorrected from changes in the general price level or for changes 
in population. Table 15.2 gives figures 1 showing the average 
farm price per bushel on Dec. 1 of each year and the total United 
States production for the year in millions of bushels. The data 
are for the years 1896-1912. (The war years are purposely 
omitted here on account of the violent fluctuation in the value 
of money.) The average production will be equal to 4,991/17, 
or 293.6, million bushels. The average price is 918.6/17 » 54.0 
cents. Let us compute r by formula (4) on page 455, 

1 From “U.S. Department of Agriculture Yearbook/’ 1922, p. 668. 
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r 


_ 269,053.7 - (17) (293.6) (54.0) _ 

•\/[l,530,629 — (17)(293.6 2 )][52,487.16 — (17)(54 2 )] 


—471.1 

14,145 


—0.033 


This would show very little relationship, since the lowest possible 
figure would be 0, and here we have a coefficient of but 0,033. 


Table 15.2.— Computation of Coefficient of Simple Linear 
Correlation between Potato Production and Potato Prices, 
1890-1912 


Year 

a) 

Production 

(X) 

(2) 

Price 

(Y) 

(3) 

(XK) 

(4) 

(X*) 

(6) 

(F‘) 

1896 

272 

29 0 

7,888.0 

73,984 

841.00 

1897 

191 

54.2 

10,352.2 

36,481 

2,937.64 

1898 

219 

41.5 

9,088.5 

47,961 

1,722.25 

1899 

260 

39 7 

10,322.0 

67,600 

1,576.09 

1900 

248 

42.3 

10,490.4 

61,504 

1,789.29 

1901 

199 

76.3 

15,183.7 

39,601 

5,821.69 

1902 

294 

46.9 

13,788.6 

86,436 

2,199.61 

1903 

262 

60 9 

15,955.8 

68,644 

3,708.81 

1904 

352 

44.8 

15,769.6 

123,904 

2,007.04 

1905 

279 

61.1 

17,046.9 

77,841 

3,733.21 

1906 

332 

50.6 

16,799.2 

110,224 

2,560.36 

1907 

323 

61 3 

19,799.9 

104,329 

3,757.69 

1908 

302 

69 7 

21,049.4 

91,204 

4,858.09 

1909 

395 

54.2 

21,409.0 

156,025 

2,937.64 

1910 

349 

55 7 

19,439.3 

121,801 

3,102.49 

1911 

293 

79.9 

23,410 7 

85,849 

6,384.01 

1912 

421 

50.5 

21,260.5 

177,241 

2,550.25 

Totals. . . 

1 

4901 

918.6 

269,053.7 

1,530,629 

| 52,487.16 


Let us correct it for the number of cases as suggested on page 450. 
This will give 

r' 2 - 1 - (1. - 0.033 2 )( 1 % 6 ) 

This equation gives a negative value for r /2 , and, as we noted on 
page 451, it means that we must call the value of the coefficient of 
correlation 0. Our conclusion is, then, that r' =* 0. There is no 
evidence of linear relationship between the variables. 

Yet it seems peculiar, does it not, that there should be no rela- 
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tionship between the production of potatoes and their price? 
(To be sure, we have not shown that there is no relationship, but 
we have failed to find any.) How can we explain it? Possibly 
by the fact that the higher productions of the later years did not 
bring lower prices because population had increased in the mean¬ 
time, and perhaps in part by the fact that the general price level 
was rising throughout the period, so that a constant price for 
potatoes would have corresponded to a falling purchasing power. 

Table 15.3.— Estimated United States Population and Index 

Numbers of Wholesale Prices of All Commodities, 1896-1912 


Year 

Estimated 

Population 

(millions) 

Index 
Number of 
Prices 

1896 

70.9 

68 

1897 

72.2 

68 

1898 

73.5 

71 

1899 

74.8 

77 

1900 

76.1 

82 

1901 

77.7 

81 

1902 

79.4 

86 

1903 

81.0 

87 

1904 

82.6 

87 

1905 

84.2 

88 

1906 

85.8 

90 

1907 

87.5 

95 

1908 

89.1 

92 

1909 

90.7 

99 

1910 

92.3 

103 

1911 

93.7 

95 

1912 

95.1 

101 


There is certainly no reason for proceeding further by linear 
methods with a problem in which there is such a small amount of 
linear correlation, but it may be worth while to correct our 
original data to see if there would then be correlation. 

It would seem logical to correct the figures on potato produc¬ 
tion by dividing them by the population of the United States, 
thus putting them in terms of production per capita. And it 
would also seem wise to correct the price figures for changes in the 
value of money by dividing them by the price index number (as 
we did on page 431). In Table 15.3 are estimates of the popula- 
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tion of the United States for each of the years in question 1 2 and 
price index numbers for the same years based on the average 
wholesale prices of all commodities in the period 1910-1914.* 

In Table 15.4 the column headed X contains figures on per- 
capita potato production, found by dividing the actual produc¬ 
tion by the population. The figures are in bushels per capita. 


Table 15.4.— Computation of Coefficient of Simple Linear 
Correlation between Corrected Potato Prices and Production 
per Capita, 1896-1912 


Year 

(X) 

1896 

3.84 

1897 

2.64 

1898 

2.98 

1899 

3.46 

1900 

3.26 

1901 

2.56 

1902 

3.71 

1903 

3.24 

1904 

4.26 

1905 

3.32 

1906 

3.88 

1907 

3.70 

1908 

3.39 

1909 

4.35 

1910 

3.76 

1911 

3.12 

1912 

4.42 

Totals. . 

59.89 


(10 

(XT) 

42.6 

163.584 

79.7 

210.408 

58.5 

174.330 

51.5 

178.190 

51.6 

168.216 

94.3 

241.408 

54.5 

202.195 

70.1 

227.124 

51.5 

219.390 

69.5 

230.740 

56.2 

218.056 

64.5 

238.050 

75.8 

256.962 

54.8 

238.380 

54.1 

203.416 

84.1 

262.392 

50.0 

221.000 

,063 3 

3,654.441 


(X‘) 

cn 

14.7456 

1,814.76 

6.9696 

6,352.09 

8.8804 

3,422.25 

11.9716 

2,652.25 

10.6276 

2,662.56 

6.5536 

8,892.49 

13.7641 

2,970.25 

10.4976 

4,914.01 

18.1476 

2,652.25 

11.0224 

4,830.25 

15.0544 

3,158.44 

13.6900 

4,160.25 

11.4921 

5,745.64 

18.9225 

3,003.04 

14.1376 

2,926.81 

9.7344 

7,072.81 

19.5364 

2,500.00 

215.7475 

69,730.15 


In the column headed Y the figures represent the price of potatoes 
in 1910-1914 dollars; that is, an attempt has been made to correct 
the prices to account for changes in the value of money. The 
average per-capita production for the period was 59.85/17 = 3.52 
bushels; and the average price (stated in terms of 1910-1914 
dollars) was 1063.3/17 = 62.55 cents. If we substitute these 
figures in formula (4) on page 455, we get 

1 From Statistical Abstract , 1933, p. 10. 

2 From Farm Economics , Cornell University, September, 1931, pp. 1586- 
1587. 
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T 


__ 3654.441 - (17) (3.52)(62.55) _ 

V[215.7475 - (17)(3.52 2 )][69,730.15 - (17) (62.55*)] 


-88.551 

128.23 


-0.690 


Correcting for the number of cases, we get 

r' 2 = 1 — (1 — 0.690 2 )(i% 5 ) - 0.441 
/ = \/0^44l = ±0.6G4 


We give this corrected value the sign of the original uncorrected 
r, which was negative. Thus we state that r ! = —0.664. 

Again the value of r is low. We shall see a little later how it is 
to be interpreted. In the meantime we shall compute the regres¬ 
sion equation and the value of S v . In computing the latter, we 
shall use the corrected value of r. Using the formula on page 458, 
we get 

Sy = <r„ \/1 - r 2 

We compute the value of a y from the equation 

„ _ [X}^~-"NMy 2 l3217Mi 

v \ N \ 17 

- 13.8 


The value of \/l — r 2 can be computed from the value of r, but 
it is much easier to look it up in books of tables. 1 Then we get 
the following value for S v : 

S v « 13.8(0.7477) - 10.3 


The standard deviation of the corrected prices themselves was 
13.8 cents, so without any regression equation at all we could have 
estimated the corrected price within 13.8 cents of the correct 
figure in two-thirds of the cases. Using the regression equation, 
we can estimate within 10.3 cents of the true figure in two-thirds 
of the cases. There has been some improvement by using the 

1 If special statistical tables are not available, the desired value may be 
read from ordinary trigonometric table s. As H olbrook Working has pointed 
out, if we let r °= cos a, then sin u » V 1 — r*- Using the present case as an 
example, r ® 0.664. (We neglect the sign of r in this work.) Find the 
angle whose cos =» 0.664. This turns out to be 48° 24'. The sine of this 
angle is 0.7477, which is the value of \/\ — r l when r « 0.664. 
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regression equation; the reduction from 13.8 to 10.3 cents is a 
reduction of 25 per cent. This is certainly better than nothing, 
yet a comparison of S v and <r u indicates that after using the 
regression equation we still have 75 per cent of the original error 
with which we started. 

We must still compute the regression equation. We shall 
substitute the proper values in the normal equations and solve for 
the values of a and 6. This computation gives us the following: 

17a + 59.895 = 1063.3 
59.89a + 215.74756 = 3654.441 

Solving, we find 

a = 130.36 
6 = -19.25 

Our regression equation, then, is 

Y = 130.36 - 19.25A' 

Suppose that the present population is 100 million, we have a 
potato crop of 250 million bushels, and the current price index is 
150. What is the expected price of potatoes? 

A crop of 350 million bushels when the population is 100 million 
people is a crop of 3.5 bushels per capita. The X in our regression 
equation represents the crop in bushels per capita. We there¬ 
fore substitute 3.5 for X in the regression equation and solve: 

Y = 130.6 - 19.25(3.5) = 63.2 

This tells us that we should expect a corrected price of 63.2 cents. 
But we should like to know what actual price to expect. The 
corrected price is the actual divided by the index number, and 
the index number is now 150. If we multiply the corrected price 
by the index number, it will be put back into the form of ordinary 
prices. This gives us 

63.2(1.5) = 94.8 cents 

We can, say, then, that our best estimate of the actual price is 
94.8 cents. 

How much error can we expect in the estimate? S v tells us 
that in two trials out of three we should be able to estimate within 
10.3 cents of the actual corrected figure. But these are 10.3 cents 
in corrected prices. With the index number at 150, this corre- 
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sponds to 10.3(1.6) « 15.4 cents of the current dollar. Hence 
we can say that the chances are 2 out of 3 that our estimate of 
94.8 cents will be within 15.4 cents of the correct figure, and we are 
practically certain that it will not miss by over 3(15.4) = 46.2. 

It will be noted that preliminary adjustments of our data 
(correcting for population change and changes in the price level) 
raised our correlation in this case from none at all to 0.664 (cor¬ 
rected for number of cases). It is possible that further adjust¬ 
ments (such as elimination of trends, etc.) would raise the coeffi¬ 
cient still higher. The student should understand that it is com¬ 
monly necessary for a statistician in any field to adjust his data 
before he begins to correlate. It is of great importance that 
these adjustments be made by someone who is familiar with the 
data and with the field of knowledge being studied. A thorough 
acquaintance with statistical methods is a great help to workers 
in many fields, but it cannot fit men to work in any field. The 
statistician must know more than statistical method before it is 
safe to turn him loose. He must know his genetics or his eco¬ 
nomics or his education or his biology first in order that he may 
know how to apply his statistical methods within the field. 

15.10. Interpretation of Correlation Coefficients.—We have 
now computed the coefficient of simple linear correlation in three 
cases. When we were studying the relationship between age and 
blood pressure, we found that r' = +0.476. When we studied 
the relationship between potato production and the farm price 
of potatoes, we found that r' = 0. Now we have studied the 
relationship of per-capita output of potatoes to their corrected 
price and find that r' = —0.664. What do these figures mean? 

To begin with, we know already the meaning of the sign of the 
coefficient. We have discovered that r has the same sign as h 
in the regression equation, and that, if the sign is plus, the cor¬ 
relation is positive, while, if the sign is minus, the correlation is 
negative (the adjectives being used in the sense explained on 
page 294). That is all we need to know about the sign of the 
coefficient, and the remainder of our attention will be given to 
the absolute size of r. A coefficient of r * —0.664 and a coeffi¬ 
cient of r = +0.664 show exactly the same degree of relationship. 
The signs merely show whether the regression line slopes up 
toward the right (positive correlation) or down toward the right 
(negative correlation). 
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We know from our past discussion (see pages 448jf.) that a 
coefficient of 1.00 denotes perfect correlation—all the points 
falling on the regression line. It is possible for us to estimate the 
values of Y from the values of X without error. We know that a 
coefficient of 0 denotes the absence of linear correlation so that 
we can make no better estimates with the regression equation 
than without it. Coefficients usually fall somewhere between 0 
and 1 in size. Then the problem of interpretation is somewhat 
more difficult. We can, however, say two or three things defi¬ 
nitely about the nature of the relationship if we know the absolute 
value of r. Let us suppose, then, that we have a case in which 
r = 0.800 (it is immaterial whether this is +0.800 or —0.800). 
Let us discover as many things about the relationship as we can. 

In the first place, we know that 



In this case, then, we know that 

S 2 

0.800 2 = 1 - ^ 
a v 2 
ft 2 

0.64 - J - 

<h / 2 

ft 2 

^ = 0.36 
a v 2 

^ = V036 = 0.600 

<r„ 

In other words, we know that the ratio of S v and a v is 0.600, or 
that we have but 60 per cent as much error in estimate by using 
the regression line as by trying to get along without it. From 
any value of r we can find, then, the ratio of S v to a v , and from this 
ratio we can tell something about the advantage of using the 
regression line. Since S y ~ <r y \/1 — r 2 , it is obvious that 

^ = vr=T’ 

o v 

To know the ratio it is necessary merely to compute this value. 1 
But, as we have seen, one can look up the value in the trigono- 

1 The value 1 — r* is called the coefficient of alienation , since it measures 
the extent of departure from perfect correlation. 
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metric tables and save the time of computing it. 1 If we use the 
trigonometric tables to look up the values of the ratio of S y /v y 
for the various values of r which we have computed, we find 
when 

T = 0 

^ = 1.00 

In this case we still retained 100 per cent of the error that we 
should have had by plain guesswork. When 

r - — 0.6G4 
—' = 0.7477 

Oy 

In this case we still had 75 per cent of the error which we would 
have had in guesswork; that is, had we guessed at the corrected 
price instead of estimating it from the per-capita production, 
our error would have been only a little greater. In the case of 
the blood pressures, when 

r = 0.476 

^ = 0.879 

Oy 

In this case the use of the regression equation cuts the error of 
estimate to 87.9 per cent of what it would be wdthout the regres¬ 
sion equation. One way, then, of interpreting r is to compute 
the amount of the ratio S y /<r y in order to find out whether the 
error of estimate when we use the regression equation is relatively 
large or small as compared with that which we get when we do not 
use the regression equation. 

If one does not have trigonometric tables handy, or wishes to 
visualize w r hat we have been discussing a little better, he can 
easily graph the relationship between S y /a y and r . Lay off 
vertical and horizontal axes on graph paper. On the vertical 
axis mark off percentages from 0 to 100 per cent. On the hori¬ 
zontal axis lay off decimals from 0 to 1. Make the scales the 
same size. Then with a compass describe a quarter circle passing 
through the 100 per cent point of the vertical scale and the point 
marked 1.00 on the horizontal scale. The horizontal scale 


1 See p. 462n. 
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represents values of r, and the vertical scale represents values of 
Sy/<Ty . Figure 15.2 shows the general idea, but the student can 
easily make a larger scale with the finer subdivisions showing on 
the graph paper, and from it he can read with sufficient accuracy 
for practical purposes the values of S v /cr y . 1 The chart will also 



Fiu. 15.2. Percentage which the standard error of estimate is of the stand¬ 
ard deviation when the coefficient of correlation has various values. 

serve to show something about the relative importance of coeffi¬ 
cients of correlation of various sizes. 

1 It will be recalled that the formula for a circle when the center is at the 
point of origin is 

s 2 -f- y 1 - r 2 

when r is the radius of the circle. We know in our case that 



Hence we know that the relationship can be shown by a quarter circle with 
unity (100 per cent, or r = 1.00) as the radius. Hence the construction 
of our graph. It is easier to construct a graph than to solve the equation if 
one has coordinate paper and a compass. 
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It will be seen that if we start with a coefficient of correlation of 
zero and continually increase the size of the coefficient, the ratio 
Sy/<r v will become smaller; that is, the error around the regression 
line will become smaller as compared with the error around the 
mean. But at first the decreases in the ratio are very small. 
An increase in the size of r from 0 to 0.1 brings almost no advan¬ 
tage. An increase from 0.1 to 0.2 brings somewhat more 
accuracy of estimate. If we increase the value of r to 0.5, we 
have eliminated 13.4 per cent of the error of guesswork, and 
Sy/cy is 86.6 per cent. In order to get to the point where the 
error around the regression line is but half the error around the 
mean, we have to continue to increase the size of r until it reaches 
0.866. This means that we get as much additional accuracy by 
increasing r from 0.866 to 1.00 as we got by raising it all the way 
from 0.00 to 0.866. The line on the chart falls very rapidly as we 
approach the value r = 1.00. This shows that with large coeffi¬ 
cients almost any increase in size will bring a decided reduction in 
the error of estimate. Many students fail to realize this point, 
and they get the idea that an increase of r from 0.1 to 0.2 is 
the same as an increase from 0.8 to 0.9. Inspection of the dia¬ 
gram shows how incorrect this notion is. 

On page 451 we found a corrected r of 0.476 when the uncor¬ 
rected value was 0.515. At that time we raised the question 
as to whether such a negligible correction was worth while. 
Later, on page 459, we computed a coefficient of 0.033 which 
corrected to 0. By changing from 0.033 to 0, we lost but H 
of 1 per cent in accuracy of estimate. In changing from 0.515 
to 0.476 we changed from a purported improvement in accuracy 
of 14.3 per cent to a corrected improvement of 12.1 per cent: 
a change of 2.2 per cent. Thus it is seen that our correction in 
the latter case was more significant than the correction which 
reduced a lower coefficient to zero. 

Let us state this in another way. By adjusting our data on 
prices and production in the potato problem, we raised the value 
of r' from nothing to 0.664. In doing this we cut the error of 
estimate to 75 per cent of its former amount (actually 74.77 per 
cent). This cut the error 25 per cent. To cut the error another 
25 per cent, so that it would be 50 per cent of the error around the 
mean, we should have to raise the value of r' to 0.866. Thus we 
see that raising r' from 0.664 to 0.866 has the same effect as raising 
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it from 0.00 to 0.664. The student should remember that small 
values of r have little importance, merely because they mean that 
there is little relationship between the data. But when one gets 
larger values of r, any adjustment of the data which will bring 
even a small increase in the size of r is worth while, because it will 
bring a large increase in the accuracy of estimates. 

This method of interpreting the meaning of r is the most useful, 
but there are one or two other methods worth consideration. 
We recall that a 2 is called “variance.” It is a measure of dis¬ 
persion. Now it is plain from the first formula on page 455 that 



In other words, the square of the coefficient of correlation is equal 
to unity minus the ratio of the variance of the errors in estimate 
around the line and the variance of the original figures. Since the 
largest possible value of S v 2 /<r y 2 is 1.00, and since we are here sub¬ 
tracting the ratio from 1.00, we are showing the percentage by 
which the error of estimate has been reduced by the use of the 
estimating equation if we use variance as our measure of error. 
For example, suppose that we correlate students’ high-school 
records with the marks made in their freshman year in college and 
find that r = +0.600. In this case r 2 — 0.36, and we can say 
that we have eliminated 36 per cent of the variance by using the 
regression equation to make our estimates. More commonly 
statisticians say, “36 per cent of the variation in college marks 
can be explained in terms of high-school records.” When a state¬ 
ment of this kind is made, one can be reasonably certain that 
it is based on r 2 . The value of r 2 is called the coefficient of deter¬ 
mination. Ezekiel says of this measure: 1 

Where both X and Y are assumed to be built up of simple elements of 
equal variability all of which are present in Y but some of which are 
lacking in X } it can be proved mathematically that r 2 measures that 
proportion of all the elements in Y which are also present in X. . . . 
It [r s ] may be said to measure the per cent to which the variance in Y is 
determined by X, Bince it measures that proportion of all the elements 
of variance in Y which are also present in X. . . . Since this is the most 

1 M. J. B. Ezekiel, “ Methods of Correlation Analysis,” p. 120, John 
Wiley & Sons, Inc., New York, 1930. 
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direct and unequivocal way of stating the proportion of the variance in 
the dependent factor which is associated with the independent factor 
it may be used in preference to the other methods. 

In our example of per-capita production and corrected price, 
we discovered that r = —0.664. We might, then interpret 
it by computing the coefficient of determination, which is 
—0.664 2 = 0.441. This would tell us that 44.1 per cent of the 
variation in corrected price can be explained in terms of per-capita 
production, the other 55.9 per cent being tied up with other 
factors, such as changes in taste and habit on the part of con¬ 
sumers, changes in incomes, and so on. The coefficient of deter¬ 
mination tells us the percentage of the variation in the dependent 
variable that can be explained in terms of the independent 
variable. 

A third method of interpreting the meaning of r can be deduced 
from the first formula for the regression equation on page 457. 
Here we find that 



From this equation it is obvious that 


Therefore 


r 


y<tz 

Xa v 



But y is the deviation of any value of the dependent variable from 
its average, and in the regression equation it refers to our com¬ 
puted value of Y. And x is the deviation of any value of X from 
the mean of the X’s. Both x and y are divided by the stand¬ 
ard deviation; this means that we are measuring the deviations 
in units of the standard deviation. Suppose that we have 
perfect correlation; that is, r — ±1.00. Then it is evident that 
y/a v = x/<r y . In other words, we shall estimate a value of Y 
which is as many standard deviations from its mean as is the 
value of X in which we are interested. We want to know the 
most probable value of Y to accompany a given value of X. 
We find how many standard deviations X is from its mean and 
place Y exactly the same number of standard deviations from its 
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mean. If, however, r « 0.500, we note that y/cr y « 0.5 x/<r 9 . 
In this case if we are trying to estimate the value of Y which is 
most likely to accompany a given value of X, we shall first 
determine by how many standard deviations the X differs from 
its average and then estimate that Y will be half as many stand¬ 
ard deviations from its average. If r = 0.75, we shall estimate 
a value of Y that differs from the average of the F’s by three- 
quarters as many standard deviations as the number by which X 
differs from the average of the X’s. In other words, we can look 
upon rasa percentage showing how far Y will be from the average 
in terms of the deviation of X. We always estimate a value 
of the dependent variable that is r times as far from its average 
(in units of the standard deviation) as is the independent variable 
from its average. 

In cases of perfect correlation r has a value of 1, and in all other 
cases the value of r is less than 1. If, then, we estimate that Y 
is r times as far from its average as X is from its average (measur¬ 
ing both in standard units), it must be that we are always esti¬ 
mating Y closer to its average than the value of X except in those 
unusual cases of perfect correlation. We look first to see how far 
the independent variable is from its average, and we then estimate 
a value of the dependent variable closer to the average. The 
dependent variable tends to “regress” toward the mean. The 
great English statistician Francis Galton was much interested 
in studies of inheritance. One of his pioneering studies in the 
field of correlation was a study of the relationship between the 
heights of fathers and the heights of their sons. He found that 
tall fathers tended on the average to have tall sons, but that on 
the average the heights of the sons were closer to the average 
than were the heights of their fathers. Hence he came to speak 
about the tendency for son’s heights to “regress” toward the 
average, and the line which pictured the relationship he called 
the regression line. This is the reason that we speak of “regres¬ 
sion,” and of “regression equations” and “regression lines” 
today. 

Each of the methods of interpreting r which we have so far 
discussed depends on standard deviations. Usually we are 
interested in the standard deviation of the dependent variable 
and the standard deviation of the scatter around the regression 
line. These are represented by <r y and S y . We learned, however, 
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in Chap. VI that the standard deviation has an exact and known 
interpretation only when computed from a normal distribution. 
This means that we cannot use the methods of interpretation 
given here with any degree of assurance unless the dependent 
variable and the scatter around the regression line (the dis¬ 
tribution of the residuals) are both normal distributions. In 
correlation work it is desirable to test these distributions for 
normality. These observations on the importance of normality 
also hold good in the more complicated problems of multiple 
and joint correlation which are to be covered in later chapters. 

Finally, if we are to interpret r satisfactorily we must know its 
standard error (or its probable error). We have seen that values 
of r below 0.5 or 0.7 do not indicate the existence of much rela¬ 
tionship. Even where r has a value considerably above these 
points, we must attempt to discover whether or not the value 
could reasonably be expected to have arisen by chance from basic 
data that were actually uncorrelated. In our study of the 
relationship between per-capita production of potatoes and their 
corrected price, we found that r = —0.6G4 when n = 17. But 
is this correlation significantly different from zero—that is, is 
it high enough to make us feel certain that there actually was 
some degree of negative relationship in the universe? To test 
this we use as a standard error the measure 1 


^ _ 1 __ 

y/n — 1 


In our present example this formula gives us 


(Jr 


1 

\/l6 


0.25 


Dividing our value of r by its standard error, we get 


r _ —0.664 
(Jr 0.25 


- 2.656 


In other words, this coefficient differs from 0 by but 2.656 stand¬ 
ard deviations. If there were no correlation at all in the universe, 
and if we drew many samples of 17 cases each, we should expect 

1 This formula is used only when testing to see if a given coefficient is 
significantly different from zero. It is evidently found by letting r » 0 in 
the formula on p. 452. 
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in half of the samples to find positive correlation. In another 
49.6 per cent of the cases we should expect to find coefficients 
of correlation between 0 and —2.66<r. But in four-tenths of 1 
per cent of the cases we should get just by pure chance negative 
values of r of greater absolute magnitude than this one. While 
it is rather unlikely, then, such a coefficient may possibly have 
arisen by chance, and we cannot call our result significant with 
entire confidence without studying more cases. 

The last statement must not be interpreted to mean that there 
was really no correlation in the universe. In fact, it is much more 
likely that there was negative correlation in the universe than 
that there was not. We do not say that correlation did not 
exist, but merely that we have not studied enough cases to be 
sure of it yet. Seventeen cases are not enough on which to 
make a decision. For this reason we say that the correlation in 
our sample was not significant, and we either let the whole matter 
drop or proceed to study more cases to discover whether or not 
the relationship continues. 

We can learn much more about the probable value of the 
relationship in the universe by carrying through the z-transfor- 
mation described on pages 452/. We find (by interpolation 
in Appendix IV) that when r — —0.664, z = —0.7999. The 
standard error of z in this case (with n = 17) is 0.2673. We 
should expect to find the true z value of the universe within 3 a z 
of the z value of the sample, or between —0.7999 + 0.8019 and 
—0.7999 — 0.8019. Thus we have z limits of —1.6018 and 
+0.0020, corresponding to r values of —0.922 and +0.002. 
From these figures it is again evident that our sample is alto¬ 
gether too small to tell us with any degree of certainty what is 
true of the universe. Although our sample yielded a negative 
coefficient of correlation, we discover that it is quite possible 
for the coefficient for the universe to be practically zero. 

It is important for us to distinguish between two adjectives 
which are commonly used to describe coefficients of correlation. 
Sometimes we say that a coefficient of correlation is high, or that 
there is a high correlation. Again we may say that a coefficient 
is significant, or that there is a significant correlation . A high 
correlation is one in which the absolute size of the coefficient 
(that is, disregarding the sign) is close to 1.00. The term would 
not usually be applied until r exceeded about 0.85 or 0.90, 
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although there is no general rule with regard to its use. A 
significant correlation is one which, when taken in conjunction 
with the corresponding value of z, indicates with reasonable 
certainty that the direction of the correlation in the universe 
(positive or negative) is the same as the direction of the corre¬ 
lation in the sample. If we accept three standard errors as 
the limit of fluctuation which can be accepted as being the 
result of pure chance, we can restate this by saying that one 
converts any value of r to the equivalent value of z. He then 
computes the standard error of z, and if z exceeds three times 
its standard error the correlation is said to be significant. In 
these cases where z exceeds three times its standard error, it is 
obvious that the sign of the coefficient of correlation in the uni¬ 
verse must be the same as the sign of the coefficient in the sample. 
Let us suppose that r is very small—say 0.090. In this case the 
value of z is 0.0902. If the sample contained 5000 cases, the 
standard error of z was 0.014, and 2 was more than three times 
its standard error. We can feel sure in this case that there was 
correlation in the universe, and that it was positive—but we can 
also feel sure that any linear correlation in the universe was too 
small to be of any practical value when we wish to estimate 
values of Y from values of X . The correlation was low, but 
significant, in this case. 1 

15.11. Correlation of Grouped Data.—We have seen that it is 
possible, by making certain assumptions relative to the distribu¬ 
tion of the items within the classes, to compute averages and 
measures of dispersion from data which have been classified in 
frequency tables. It is also possible to compute the value of r (or 
at least to approximate it) from data which have been classified 
in a two-way frequency table. Such tables are usually called 

1 It is still a very common practice to give the value of r in conjunction 
with the value of its standard error or its probable error. It was once the 
accepted rule to consider a coefficient of correlation significant if it exceeded 
three times its standard error or 4.5 times its probable error. Setting any 
border line between significance and insignificance is arbitrary, and the 
particular point selected may be questioned. But the use of the standard 
error or the probable error at all in connection with coefficients of correlation 
should be discouraged now that we know that values of the coefficient found 
when repeated samples are taken from the same universe are not distributed 
normally. Especially when n is small or r is large, the student must learn to 
distrust the old measures and to rely on the e-transformation. 
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correlation tables . We can illustrate with a table which shows 
the relationship between the length of the femur (upper bone of 
the hind leg) and the length of the humerus (upper bone of the 
front leg) in 370 rabbits. One would rather expect from obser¬ 
vation that rabbits with long front legs would tend to have long 
hind legs, and that those whose front legs were short would tend 
to be underslung behind also. If this is correct we should get a 
positive correlation. The data appear in Table 15.5. Each 

Table 15.5.—Numbers of Rabbits with Various Combinations of 
Femur Length and Humerus Length 


Femur Length (millimeters) 


Humerus 

Length 

(millimeters) 


Totals 

76- 

77 

78- 

79 

80- 

81 

82- 

83 

84- 

85 

86 - 

87 

88 - 

89 

90- 

91 

92- 

93 

94- 

95 

78-79 










1 

1 

76-77 











0 

74-75 









2 


2 

72-73 



.. 

1 


1 

4 

i 

3 


10 

70-71 





3 

13 

13 

4 



33 

68-69 



1 

10 

29 

29 

4 




73 

66-67 



13 

52 

47 

4 





116 

64-65 


9 

51 

32 

4 






96 

62-63 

2 

16 

13 

4 







35 

60-61 

1 

2 

1 



! 





4 

Totals. 

3 

27 

_I!j 

99 

83 

47 

21 

5 

5 

1 

370 


figure in the table shows the number of rabbits whose femur 
length was within the range given at the top of the column and 
whose humerus length was within the range given at the left of 
the row. That is, all the rabbits listed in a given column had the 
same (approximately) femur length, but as we pass down the 
column we come to shorter and shorter humerus lengths. In any 
row all rabbits have about the same humerus length, but as we 
pass from left to right we pass to cases of greater and greater 
length of femur. 

It will be seen that the distribution of entries in the table 1 is 
similar to the distribution of dots on a scatter diagram. In this 

1 The figures are taken from Castle, “ Genetics and Eugenics,” p. 68, 
Harvard University Press, Cambridge, Mass., 1916. 
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case the entries fall in a band which rises as we move toward the 
right. It is immediately evident that long femurs are associated 
with long humeri. It still remains for us to compute the value 
of r, the value of S vt and the regression equation. These values 
could, of course, be found by assuming that each rabbit was 
at the mid-point of his class, extracting the data from the table, 
and computing r in the usual way. Thus, we could list for one 
rabbit X = 94.5 and Y = 78.5. (This rabbit would be the one 
entered at the extreme upper right of the table.) Similarly we 
could extract the figures for each other rabbit, duplicating the 
figures as many times as there were rabbits in each class. There 
is, however, a shorter way of going at the problem which gives 
the same answers. It involves taking arbitrary origins of classes 
and taking deviations in class-interval units, as was done when 
we computed the mean and the standard deviation by the short 
methods. 

We start with the data arranged in a correlation table in the 
form we have just seen. We total each column and each row. 
The column totals give us a frequency distribution of the X 
values just as though we had them arranged in a frequency 
table, and the totals of the rows give us a frequency distribution 
of the Y values. We take an arbitrary origin at the center of 
one of the class intervals (near the center of the X distribution 
and again near the center of the Y distribution) and then count 
off the class deviations in each direction as we did when finding <j 
by the short method (see pages 141/.). If we label the totals 
of the columns f x (meaning frequencies of the X classes) and the 
totals of the rows f v (meaning the frequency with which various 
values of Y occurred), and if we call the class deviations of the 
X's d x and the class deviations of the Y’s d u , we can proceed to 
the major part of the computation immediately. 

We wish to compute the values which are necessary in the 
derivation of the two standard deviations. These values are 
2/* (or 2/ y , since these two values are equal and each equals N) t 
2(/xd x ), 2(fxd z 2 ), 2 (fyd v ) y and 2( fyd y 2 ). We can proceed to 
compute these one at a time, getting our original values of f x 
and f y from the totals of the columns and rows, respectively, and 
taking arbitrary origins. The computations are given in Table 
15.6. 

The values of f v and f s come from the table on page 475, To 
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check our arithmetic, we note that Xf» always equals Xf y . We 
can now go ahead with the computations of the means and the 
standard deviations in accordance with the formulas on pages 
90 and 143, The results in this case are 

X = 84.5 + 2 (7^) = 83.306 mm. 

The value of the guessed mean was the class mark of the class 
containing 83 cases. This class mark was 84.5, which appears 
in the above computation as the guessed mean. In the case 

Table 15.6. —Computation of Coefficient of Simple Linear 


Correlation from Grouped Data, Based on Table 15.5 


/. 

d , ; 

fxd x 

f*d x * 

u 

d. 

fydy 

w 

1 

+5 

5 

25 

1 

6 

6 ! 

36 

5 

+4 

20 

80 

0 

5 ! 

0 

0 

5 

+ 3 ; 

15 

45 

2 

4 

8 

32 

21 

+2 

42 

84 

10 

3 

30 

90 

47 

+ 1 

47 

47 

33 

2 

66 

132 

83 

0 

0 

0 

73 

1 

73 

73 

99 

-1 

- 99 

99 

116 

0 

0 

0 

79 

-2 

—158 

316 

96 

-1 

-96 

96 

27 

-3 ! 

- 81 

243 

35 

-2 

-70 

140 

3 

-4 

- 12 

48 

4 

—3 

-12 

36 

370 


— 221 

~ 987 

370 


+ 6 

635 


of the F's, we guessed the mean at the class mark of the class 
containing 116 cases. This class mark is 66.5, which appears 
in the computation of the average of the F’s: 


t - 665 + 2 (±§) 
The two standard deviations are 


66.527 mm. 



a. = V3M = 1.74 



c v ■=> Vl/DB - 1.31 
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These standard deviations are both in units of the class interval 
(which is 2 mm.) and would have to be multiplied by this class 
interval if they were to be given in millimeters. For our pur¬ 
poses, however, they can be used as they are in class-interval 
units, since we merely wish to compare them with other figures 
which are also in these units. 

It is still necessary for us to compute something equivalent 
to the E(XY) column of our previous method; that is, we need 
some total of the cross-products found by multiplying together 
the various X and Y values. We obtain it by multiplying each 
entry in the original table on page 475 by its d x value and by its 
d v value, keeping track of signs. If we rewrite the original table, 
using the class deviations instead of the original class limits, 
we get Table 15.7. We now multiply each entry in the body of 


Table 15.7. —Computation of Coefficient of Simple Linear 
Correlation from Grouped Data, Based on Table 15.5 



the table by the deviation in the column at the extreme left and 
by the deviation in the row at the extreme top. For example, in 
the upper right-hand corner is the figure 1. This would be mul¬ 
tiplied by the figure 5 at the top and by the figure 6 at the left 
to give 1 X 5 X 6 = 30. The highest figure in the next column 
to the left is 2. This will be multiplied by the figure 4 at the 
head of the column and the figure 4 at the left of the row to give 
2 X 4 X 4 = 32. Some of the results may be negative. For 
example, the highest number in the fourth row from the left is 
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under the heading —1 and at the right of the heading 3. Hence 
we get, since the number itself is 1, the following: 

1 X - 1 X 3 « -3 

We carry out such a multiplication for each entry in the table, 
each time multiplying together three numbers (the number 
entered in the body of the table, the number at the head of the 
column in which it is entered, and the number at the left of the 
row in which it is entered). We add the products so obtained, 
keeping track of the signs. In the case of the present table, we 
then have 

Xfdjy = 627 

We now solve for r by the formula 

Zifdj*) _ 

r _ __N_ _ \ N )\n ) 

(o-*) (<T V ) 


The values of <r* and which are substituted in this formula must 
be in units of the class interval. In the present case, this formula 


gives 


627 / —22l\ (J6_\ 

370 \ 370 ) \370/ _ 1.703 

T ' (1.74) (1.31) 2.279 


+0.747 


The corrected value can be found by the usual method, although 
when there are as many as 370 cases the correction is of little 
importance. Using the correction formula on page 450, we get 

+2 = l — (1 — 0.747 2 ) (H) = 0.557 
r' = V6T557 « ±0.746 

The corrected value is, of course, almost identical with the uncor¬ 
rected value. When one is working with a large number of cases, 
as in the present problem, it never pays to use this correction. 
We find S y from the usual formula: 

Sy = <?y \/l - r 2 

We found that <r v ® 1.31 class intervals (see page 477). But the 
class interval is 2 mm., so we have to multiply by this figure to 



480 


ELEMENTS OF STATISTICAL METHOD 


get the standard deviation in the original units: 

<r„ = 1.31(2) - 2.02 mm. 

S v - (2.62) (0.066) = 1.74 mm. 

We expect, then, to be able to estimate humerus length with an 
error of 1.74 mm. or less in two-thirds of the cases, and we should 
never expect to make an error of over 3(1.74) = 5.22 mm. 

We can compute the regression equation from the first formula 
on page 457. Our two standard deviations (found by multiplying 
the figures on page 477 by the class interval of 2) are = 3.48 mm. 
and <r v = 2.62 mm. If we substitute these values in the equation, 
we get 

* - +0 ' 746 ( ra ) 1 

y = 0.561x 

But x and y are deviations from the respective averages; that is, 
x =* X — X, and y = Y — ?. If we substitute these other 
values in the equation above, we have 

Y - ? = 0.561 (X - X) 

But we know the values of Y and X from page 477. Substituting 
their values in the equation, we get 

Y - 66.53 - 0.561 (X - 83.3) 

Y * 19.8 + 0.561X 

This is our regression equation in its simplest form. If we wish 
to estimate the humerus length of a rabbit with a femur 90 mm. 
long, we find 

Y = 19.8 + 0.561(90) - 70.3 mm. 

In this case the value of a in the type equation has no sensible 
interpretation when taken alone; that is, it makes no sense to say 
that rabbits whose femur length is zero will tend to have a 
humerus length of 19.8 mm. In such cases, where the usual 
interpretation makes no sense, one thinks of the value of a merely 
as a necessary auxiliary to be used in determining the values of 
the dependent variable, and one does not try to interpret it 
separately. 

The value of b is 0.561. This tells us that there tends to be 
an increase of 0.561 mm. in humerus length for each 1 mm. 
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increase in femur length; that is, b always tells us the change in 
the dependent variable which accompanies an increase of one 
unit in the dependent variable. On page 463 we found that the 
regression equation for estimating corrected potato prices from 
the per-capita production was 

Y - 130.36 - 19.25X 

Here again it makes little sense to say that if there were no pro¬ 
duction the price would be 130.36 cents. But we must always 
interpret the value of b, which is —19.25. This means that there 
tends to be a reduction of 19.25 cents in the corrected prices for 
each increase of one bushel in per-capita output. 

Finally, we should like to find the reliability of our coefficient 
of correlation. When r = 0.746 we find that z ~ 0.964. Since 
n = 370, we find that «r g = l/\/367 = 0.0522. We should 
expect in two-thirds of the cases to find z values between 0.964 + 
0.052 and 0.964 — 0.052, or between 1.016 and 0.912. These 
correspond to r values of 0.769 and 0.723. We should almost 
never expect to find the z value of any sample farther than 3<r* 
from that of the universe. In this case we can say that the prac¬ 
tical limits to the z values are 0.964 + 0.157 and 0.964 — 0.157, 
or 1.121 and 0.807. These figures set the limits beyond which we 
feel certain that the value of r in the universe does not lie at 0.808 
and 0.668. 

15.12. Simple Curvilinear Correlation.—All our computations 
up to this point in this chapter have been based on the assumption 
of linear regression. The symbol r is always used to designate 
the relationship around the least-squares straight line—to indi¬ 
cate the extent to which this line is helpful in estimating values of 
Y from values of X. It is possible to have a close relationship 
between X and F, yet to have that relationship be markedly 
curvilinear, so that no straight line can be very helpful in esti¬ 
mating F. In such a case r will be low. As we have said before, 
a low value of r does not indicate that there is little or no relation¬ 
ship, but that there is little or no linear relationship. 

It is just as simple and logical, as we discovered in Chap. XII, 
to fit curves as to fit straight lines. And after we have fitted a 
curve, it is just as simple and logical to find the degree of relation¬ 
ship around the curve as around a straight line. For example, in 
Sec. 12.8 we fitted a reciprocal curve to a scatter diagram showing 
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the relationship between potato production and prices (see Fig. 
12.13). The formula for this curve was 

Y = -0.445 + 0.346X 

This is our curvilinear regression equation. Next we should find 
something to correspond to our standard error of estimate and our 
coefficient of correlation. The former is easy to ascertain, since 

Table 15.8.— Determination of Errors of Estimate from Reciprocal 
Regression Line 


Year 

Production 

Estimated 

Price 

Actual 

Price 

Difference 

1006 

2 7 

2.05 

1.6 

-0 45 

1907 

2.6 

2.21 

2.0 

-0 21 

1908 

2 3 

2.86 

2.6 

-0 26 

1909 

3.2 

1.51 

1.2 

-0.31 

1910 

2.7 

2.05 

2.0 

-0.05 

1911 

2.5 

2.39 

3.1 

+0.71 

1912 

3.5 

1 31 

1.3 

-0 01 

1913 

2.7 

2.05 

2.1 

+0 05 

1914 

3.5 

1.31 

1.4 

+0 09 

1915 

2.8 

1.91 

2.0 

+0.09 

1916 

2.2 

3.17 

3.9 

+0.73 

1917 

3.2 

1.51 

1.9 

+0.39 

1918 

3.0 

1.69 

1.8 

+0.11 


we know that it is merely the standard deviation of our errors of 
estimating by use of the regression equation. We have already 
seen that our curvilinear regression equation may be used for 
estimating values of Y from known values of X. Let us estimate 
the price which would be expected with each of the actual produc¬ 
tions listed in Table 12.7 on page 3G8, and compare these esti¬ 
mated prices with the actual ones. The comparison is made in 
Table 15.8. Here each figure in the column of estimated prices 
is computed by substituting the year’s production figure for X in 
the reciprocal regression equation, and solving. The differences 
in the last column are found by subtracting the estimated from 
the actual values. These differences are the residuals around the 
reciprocal regression line—the vertical distances of the points in 
Fig. 12.13 from the line. The standard deviation of these differ- 
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ences or errors is the standard error of estimate, as we learned on 
page 445. Computing the standard deviation of the figures in 
the last column of Table 15.8 tells us that 


S v « 0.11 


To distinguish a coefficient of simple curvilinear correlation 
from a coefficient of simple linear correlation we use for the 
former the symbol p instead of the symbol r, which is used with 
linear regression lines. When the statistician sees the symbol p 
he knows that estimates of Y have been made by means of a 
curve. The general formula, however, is one with which we are 
already familiar, and the methods repeat those of simple linear 
correlation. We next compute the standard deviation of the 
original prices. This turns out to be 0.7211. We need only 
substitute these two values in the equation on page 447 to dis¬ 
cover that 


P 




0.011 2 
0.7211 2 


0.988 


It is well to point out in connection with this illustration the 
fact that the residuals as shown in the table are peculiarly 
arranged. In the early years the residuals tend to be negative 
and in the later years positive. Such a condition would lead 
one to believe that it would be advantageous to eliminate 
trends before correlating. 

The scatter diagram has shown us that the reciprocal curve 
is a better fit than is the straight line. The fact that the curve 
describes the relationship better than does the line can be seen 
also by comparing the index and the coefficient of correlation. 
Using the data from Table 12.7, we find that r = —0.822. When 
r and p are both computed from the same raw data, the latter 
will always either equal or exceed the former in size. If the 
data are actually distributed in a straight band (that is, if they 
are really linear), the two coefficients will be equal. If the data 
are curvilinear, the index of correlation will always be greater 
than the coefficient of correlation. This is to be expected, since 
in this case the curve will be a better basis of estimates than the 
straight line. 

If we were to compute p from the second-degree parabola, 
we should follow the same steps we have just taken. First, 
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we should estimate each year’s price from the production by 
means of the regression equation. Second, we should find the 
differences between the actual prices and the estimated prices. 
Third, we should find the standard deviation of these differences. 
This would be the standard error of estimate, and in the case of 
the parabola it equals 0.338. We should substitute this value, 
along with the standard deviation of the l r, s, in the equation 
for p and then solve the equation. In this case our answer is 
0.884. Thus, the parabola gives us a better basis for estimate 
than does the straight line, but not so good a basis as the recip¬ 
rocal curve. 

15.13. Corrections for Degrees of Freedom.—If w^e have but 
two points on our scatter diagram, the best fitting straight line 
will pass through them both and wc shall have perfect correlation. 
In other words, if we try to correlate two observations, we are 
bound to get perfect correlation. Similarly, if we have but three 
points the best fitting second-degree parabola will pass through 
them all with no error of estimate, and p will equal one. Like¬ 
wise, the best fitting third-degree parabola will pass through four 
points; and so on. In our most recent illustration we had 13 
cases. Had we wished to fit a 12th-degree parabola, it would 
have passed through all the points and we should have found 

p *= 1.00. 

It is evident, then, that the results of correlation analysis give 
a somewhat exaggerated picture of the degree of the relationship, 
and that the extent of the exaggeration depends on two things: 

1. The number of cases studied in the problem. 

2, The number of parameters in the equation. 

Just as we corrected the standard error of estimate and the coeffi¬ 
cient of correlation when our relationship was linear, so must we 
correct our results when we have used curvilinear methods. In 
fact, the corrections are much more important in the latter case 
because the number of parameters is larger. The formulas for 
the corrected values (letting represent the corrected standard 
error of estimate and p' represent the corrected index of correla¬ 
tion) are 

«>’ - <‘V> 

w - i - a - 



SIMPLE CORRELATION 


485 


In these two equations P represents the number of parameters in 
the regression equation if the equation is parabolic. For freehand 
curves or other nonparabolic curves one uses for P the number of 
parameters which it would be necessary to use in a parabolic 
equation to give as many twists as there are in the curve which 
is used. For example, the reciprocal curve has one bend (see 
Fig. 12.13, page 369). The second-degree parabola, which is 
described by an equation with three parameters (a, 6, and c, as on 
page 344), also has one bend. Thus if we have used the reciprocal 
curve (or a freehand curve with one bend) we let P = 3. If one 
computes p from a curve with two bends, one lets P = 4, etc. 
In general the value of P will be greater by two than the number 
of bends in the curve. 

In the case of our reciprocal curve we found that S v = 0.11 and 
that p = 0.988 (page 483). Let us correct these values. The 
reciprocal curve has one bend, and hence is equivalent to the 
second-degree parabola Y = a + bX + cX 2 . The equivalent 
number of parameters is 3. Hence we have 

(-s;) 2 = 0.11* (jg) = 0.01452 
S' v = 0.12 

(p') s = 1 - (1 - 0.988 2 ) (^j = 0.9713728 
p' = 0.986 

On page 484 we gave the results of the parabolic curve as 

S v = 0.338 
p = 0.884 

These give the following corrected values: 

S'y = 0.370 
p' = 0.858 

On page 483 the linear correlation results are given as 

r = —0.822 

Hence we get the corrected value r f = 0.804. In this latter case 
P - 2. 

We can now compare the results of the various methods which 
we have applied to the potato study. We find with the reciprocal 
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curve that p' * 0.986. With the second-degree parabola 
p' ■* 0.858. With the straight line r' * 0.804. The reciprocal 
curve is the best to use for purposes of estimating the price, and 
the straight line is the worst of those tried; but even the straight 
line is better than nothing. With the reciprocal curve = 0.12, 
which means that two-thirds of our estimates should fall within 12 
cents of the actual price and that we should never be in error by 
over 36 cents. With the parabolic curve S' v ~ 0.37; hence two- 
thirds of our estimates should be within 37 cents of the actual 
price and we should practically never be in error by over $1.11. 
The standard error around the straight line is 0.429. This means 
that, if we use the straight line as the basis of our estimates, we 
should be within 43 cents of the correct price two-thirds of the 
time and we should almost never err by over $1.29. 

It is necessary to add one word of explanation to the formula 


for correcting p. 


If the value of (1 


-P 2 ) 



should turn 


out to be greater than 1, so that the value of p 2 is negative, the 
corrected result should be called zero; that is, there is no evidence 
of relationship. 

We noted a moment ago that if the number of cases and the 
number of parameters are equal, the line will pass through all the 
points and the uncorrected index of correlation will equal unity. 
It will be mathematically necessary for us to get perfect correla¬ 
tion. But note what will happen when we correct the index of 
correlation. If the number of cases and the number of parameters 
are equal, N — P = 0. Thus we have a value of zero for the 
denominator of the last part of the correction formula. But 
division by zero is not allowed in algebra; the idea is absurd. 
Likewise the idea of fitting a complicated curve to a small number 
of points is absurd. 

15.14. Linear and Curvilinear Correlation Compared.—It will 

be noted that the formula we have used for p is exactly the same 
as the basic formula that we first derived for r on page 447; that 
is, the concepts are the same. The coefficient of correlation tells 
us whether or not it is advantageous to use a straight line in esti¬ 
mating values of the dependent variable. The index of correla¬ 
tion tells us whether or not it is advantageous to use some particu¬ 
lar curve in estimating values of the dependent variable. 

We have, however, made a slight departure from our former 
practice in fitting the reciprocal curve, and we should make a 
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similar departure if we fitted a logarithmic curve. In fitting the 
reciprocal curve we found values of the parameters in such a way 
that we minimized the sum of the squared deviations. But these 
were not, as before, deviations of the F’s from the average F. 
This time they were deviations of the reciprocals of Y from the 
average reciprocal. Thus we minimized the sum of the squared 
deviations of the reciprocals from the mean reciprocal. In fitting 
a logarithmic curve (which we should do just as we fitted the 
reciprocal curve, save that we should use logs of Y rather than 
reciprocals of F), we should minimize the sum of the squared 
deviations of the logs of F from the mean log of F. In such cases 
one must admit that the least-squares criterion of goodness of fit 
is no longer being used, since we are not minimizing the sum of the 
squared deviations in the sense which we originally described. 
In other words, while there is a theoretical advantage in using the 
least-squares approach when fitting straight lines and parabolas, 
this advantage disappears when we fit reciprocal curves, loga¬ 
rithmic curves, and the like. In these cases the use of the normal 
equations which are derived in such a way as to minimize the sum 
of the squared deviations is arbitrary and based merely on the 
fact that they are convenient. Theoretically one may as well use 
the freehand regression line, and as a matter of fact this has the 
advantage that the man using it is not so likely to be led astray by 
misplaced reliance on mathematical computation as he is when 
using the more formal methods (see last paragraph of Chap. I, 
pages 5~6). One can, of course, determine the equation of his 
freehand regression curve if he wishes by the methods already 
described for finding the formula of freehand trend lines (see 
Chap. XII). 

15.15. Standard Errors in Curvilinear Correlation.- We should 
naturally like to test the reliability of our index of correlation to 
find whether the index is apt to be indicative of the nature of the 
relationship in the universe or whether it may have arisen from 
peculiarities of the particular sample studied. For this purpose 
some authors have suggested the use of the formula 

, = i -p 2 

" y/N - P 


In the case of the reciprocal curve with which we have just 
been working, we found that p = 0.988, N = 13, and P = 3. 
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From these figures and the above formula we should compute 
<r p 0.00755. However, we saw on page 451 that these common 
formulas are often misleading when applied to small samples or 
to instances in which the relationship is large. The situation is 
even worse in curvilinear than in linear correlation. The curvi¬ 
linear coefficients found in various samples from the same uni¬ 
verse are decidedly non-normal in distribution, and the use of 
ordinary methods of standard or probable error is out of the ques¬ 
tion. Hotelling has said, 1 “The probable error of the correla¬ 
tion ratio may now' be considered as an obsolete concept; the 
assumption of a normal distribution is in this case an extremely 
crude approximation.” It is possible that the reliability of the 
curvilinear regression line could be determined by an extension 
of analysis of variance, but these methods are too advanced for 
us to take up here. The interested reader is referred to the stand¬ 
ard work on the subject by R. A. Fisher. 2 We are probably safe 
in assuming at least that there is some correlation in the universe 
when the value of p is as high as it is in our present case. 3 

16.16. Suggestions for Further Reading.—Nowhere can the student find 
a better source of information on correlation methods and procedures than 
in Mordecai Ezekiel’s “Methods of Correlation Analysis/’ John Wiley & 

1 Journal of the American Statistical Association 1 Vol. XXVI, No. 173A, 

p. 82. 

* “Statistical Methods for Research Workers,” Oliver & Boyd, Edinburgh 
and London, 1938. 

8 In this connection we should note that it is rather difficult to define our 
universe in the present case. We have based our computations on figures 
showing potato production and prices for the years 1900-1918. But it can 
hardly be said that those are a random sample of the years before and the 
years after. They were not chosen by lot from all the years m existence 
(or from all the years that ever did or ever will exist). Only in so far as the 
years 1906-1918 are representative of other years can we apply our con¬ 
clusions to other periods. This fact is but one of the difficulties which arise 
when we correlate time series. Ordinarily we should also hesitate to corre¬ 
late time series without first eliminating trends and seasonal movements; 
otherwise we are apt to get spurious correlation. If we correlate monthly 
egg prices in New York City with monthly mean temperatures in Bangkok, 
we shall almost certainly get a sizable coefficient of correlation merely 
because both series are characterized by seasonal fluctuations, and they 
are almost certain to be fluctuating either together or in opposite directions. 
The dangers inherent in correlation problems involving historical data should 
be evident. It is impossible to give more space to them here. 
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Sons, Inc., New York, 1941. In Chaps. 20 and 21 of James G. Smith’s 
“ Elementary Statistics/' Henry Holt and Company, Inc., New York, 1934, 
the student will find an interesting treatment of correlation in which the 
approach is very different from that used in this book. This difference in 
treatment may be helpful to the student who needs another point of view 
to throw correlation concepts into relief. Chapters 13 to 16 of M. G. 
Kendall, “The Advanced Theory of Statistics," Vol. I, J. B. Lippincott 
Company, Philadelphia (Preface dated 1943) give an exhaustive treatment 
of the whole problem of contingency and association and correlation, 
although this excellent presentation is highly mathematical. G. U. Yule 
and M. G. Kendall, in “An Introduction to the Theory of Statistics," 
Charles Griffin & Co., Ltd., London, 1937, Chaps. 11 to 16, give a clear pre¬ 
sentation which goes much farther than this introductory textbook has 
done. Many authors prefer to introduce the concept of correlation as a 
special case of analysis of variance. Perhaps best among these is George 
W. Snedecor, “Statistical Methods Applied to Experiments in Agriculture 
and Biology," Iowa State College Press, Ames, Iowa, 1946, Chaps. 7 and 
12 to 14. 

EXERCISES 

1. Give two or three examples of relationship between variables which 
is such that it can be stated by a mathematical foimula, like the relationship 
between the circumference and the diameter of a circle. Give two or three 
examples of relationships that are not ordinarily so described. 

2. Table 15.9 on page 490 sho^vs the percentage of the population of each 
state which filed income-tax returns in 1930 and the number of automobiles 
registered per 100 population in 1930. The states are arranged geograph¬ 
ically, and are in the same order in which they appear in the Statistical 
Abstract, l 

Divide the states into two groups, putting the first 24 states (Maine 
through Virginia) into one group and the second 24 states (West Virginia 
through California) in the second group. Compute r for the first group. 
Use the per cent filing income-tax returns as the independent variable. 
Compute the regression equation. How* many cars per 100 people would 
you expect to find in a state in which 4.5 per cent of the people filed income- 
tax returns? Compute and explain S v . 

8. Using the ^-transformation, compute the largest and smallest values 
of r which you would expect to find in other samples from the universe of 
Exercise 2. Then compute the value of r for the second group of states in 
Exercise 2 to see what is actually true of the r of another sample. How can 
you explain the results? Were not the samples both chosen from the same 
universe, and at random as far as income taxes and automobile registrations 
were concerned? 

1 Figures are from Statistical Abstract , 1933. Figures on percentage of 
population filing tax returns are from p. 179, and those on automobiles per 
100 persons are derived from 1930 population figures on p. 9 and 1930 auto¬ 
mobile registration figures on p. 336. 
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Table 15.9.— Percentage of Population Filing Income-tax Returns 
and Number of Automobiles Registered pee 100 Population, bt 
States, 1930 

State 

Percentage 

Filing 

Income-tax 

Return 

Cars per 

100 Popu¬ 
lation 

Maine 

2 24 

23.4 

New Hampshire 

2 99 

24 0 

Vermont... 

2 40 

24 2 

Massachusetts 

4 76 

19 8 

Rhode Island 

3 47 

10 8 

Connecticut 

4 66 

20 a 

New York 

6 66 

18.2 

New Jersey 

4.65 

21 0 

Pennsylvania . . 

3 35 

18 2 

Ohio. 

3 00 

26.4 

Indiana .... 

2 03 

27 0 

Illinois . . 

4 29 

21.4 

Michigan 

3 04 

27.2 

Wisconsin 

3 24 

26 6 

Minnesota 

2 24 

28 7 

Iowa 

1 62 

31 4 

Missouri 

2 36 

21 0 

North Dakota 

1 21 

26 8 

South Dakota 

1 36 

29 6 

Nebraska 

1 98 

30 9 

Kansas 

1 74 

31.6 

Delaware 

3 92 

23 4 

Maryland 

4 19 

19 7 

Virginia 

1 67 

15.4 

West Virginia 

1 57 

15.3 

North Carolina 

0 80 

14 2 

South Carolina 

0 70 

12.5 

Georgia .. 

1 00 

11 8 

Florida 

1 92 

26 1 

Kentucky . 

1 19 

12 6 

Tennessee 

1 25 

14 0 

Alabama ..... 

0 85 

9.6 

Mississippi. ... 

0 60 

11.8 

Arkansas . 

0 67 

11.9 

Louisiana. 

1 57 

13.0 

Oklahoma 

1 36 

22 8 

Texas. 

1 80 

23 3 

Montana. 

2 16 

25.1 

Idaho. 

1 76 

26.8 

Wyoming. 

3 02 

27.4 

Colorado. 

2 80 

29.8 

New Mexico 

1.49 

19.7 

Arizona. ... 

2 43 

25.4 

Utah. 

2 32 

22 4 

Nevada. 

4.40 

33 0 

Washington. 

1 69 

28 5 

Oregon. 

1 18 

28.6 

California. 

4.78 

35.7 
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4. In a certain Connecticut dairy region a study was made of farmers* 
incomes. 1 It was discovered that there was a correlation between the num¬ 
ber of cows on a farm and the gross income of the farm of 

r « +0.519 ± 0.0638 

а. Of how much advantage was the regression equation in the reduction 
of the error of estimate? 

5. How many farms were studied? 

c. What can you say about the value of r in the universe from which these 
farms were drawn? 

d. If a farm has 7 more cows than the average, and if the standard devia¬ 
tion in number of cows on these farms is 3, how large a gross income would 
you expect it to produce? 

e. What was the standard error of the coefficient of correlation? 

/. Assuming still that the standard deviation in number of cows is 3, what 
is the standard error of estimate, S v ? 

6. Suppose that a curvilinear relationship exists between two variables, 
and that we compute the value of r. Will this coefficient overstate or under¬ 
state the degree of the relationship? Why? 

б. Look up some data which you think should show relationship. Plot 
them on a scatter diagram. 

7. On page 456 is a list of directions for computing r in accordance with 
formula (4) on page 455, Write out a similar list of directions for formula 
(3) of the same page. 

8. Solve the two normal equations (page 457) and show that 

a r 2 xr - xxx\ 
n z x 2 -(xxy 

XX 2 XY - XXXXY 
NXX 2 - (2A )*■" 

9. Show from the normal equations (page 457) that we can also deter¬ 
mine the values of a and b from the following formulas (M, = mean of the 
X’s, etc.): 

X(XY) - NM X M V 
= X{X 2 ) - N{M X ) 2 
a = M v — bM x 

10. Find in trigonometric tables the value of \/l — r 2 when r * 0.758; 
when r — 0.222. Check the results by longhand computation. 

11. By definition Xy 2 » X(Y — F) 3 . Show, therefore, that 

2 y* - 2 Y* - NY 2 

as is stated on page 458. 

12. Make a chart showing the relationship of S v /<x v and r, as suggested on 
page 466. Check the problems of Exercise 10 above with the chart. 

1 1. G. Davis and C. I. Hendrickson, Soil Type as a Factor in Farm 
Economy, Storrs Agricultural Experiment Station Bulletin 139, p. 92. 


b = 
a = 
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18 . Compute and explain the coefficient of determination for Exercise 4 
above. 

14 . Groves and Ogburn studied social phenomena in 170 cities. They 
determined for each city the sex ratio (number of males per 100 females) 
and the percentage of women 25 years of age and over who were married. 
Their results are given in Table 15.10 in which we have listed the number of 
cities falling in each class. 1 

Table 15.10.— Numbers of Cities with Various Combinations of Sex 
Ratio and Percentage of Women Married 


Per Cent of Women Married 


Sex Ratio 

44 

48 

52 

56 

60 

64 

68 

72 

76 

80 

84 

88 


to 

to 

to 

to 

to 

to 

to 

to 

to 

to 

to 

to 


47 

51 

55 

59 

63 

67 

71 

75 

79 

83 

87 

91 

60- 68 
69- 77 

1 


2 










78- 86 


1 

1 

2 

2 

1 

1 






87- 93 




5 

18 

17 


1 





96-104 




1 

5 

30 

18 

6 





105-113 






3 

6 

9 

1 




114-122 






1 

7 

10 





123-131 





1 


2 

7 

1 




132-140 








2 


1 



141-149 









2 

1 



150-158 

159-167 

168-176 









1 


1 


177-185 

186-194 











1 

! 1 


Would inspection of the data lead you to believe that the relationship, if 
any, is positive or negative? Note that the scale at the left is reversed, with 
the small items at the top. Compute r, S v , and the regression equation. 

15 . How many parameters would be necessary in a parabolic equation 
to give eight bends to a curve? 

18 . Using the normal equations of the second- and third-degree parabolas 
as guides, write the normal equations of the fourth-degree parabola. Stu¬ 
dents of the calculus should derive the equations to check their results. 

17 . If, in a given problem, the number of parameters exceeds the number 
of points on the scatter diagram, what happens to the correction formulas 
on page 484? Explain. 

1 Groves and Ogburn, “American Marriage and Family Relationships/ 7 
p. 481, Henry Holt and Company, Inc., New York, 1929. 





























CHAPTER XVI 

MULTIPLE CORRELATION 

16.1. Nature of Multiple Relationships. —We have discovered 
that it is often possible to make use of a knowledge of the value 
of one variable when we are trying to estimate the value of 
another. When the knowledge of one variable makes possible 
more accurate estimates of the value of another variable than 
would be possible without it, we have said that the two variables 
are related or correlated. Yet it is evidently true that when we 
are trying to make an estimate of the value of some one variable 
(say a person’s weight) we may wish to consider not one other 
variable but several others. For example, if you are asked to 
estimate the weight of an unknown person, what data will help 
you to make your estimate? You will want to know the person’s 
age, height, sex, nationality, etc. Obviously weight is correlated 
with several other things, not merely with one, and the regression 
equations which we have used up to this point have enabled us to 
estimate the value of one variable by substituting the value of but 
one other variable. How much handier it would be if we could 
find an equation in which we could substitute figures for both a 
man’s height and his age in making an estimate of his weight! 

Problems that involve the determination of the relationship 
between one variable and several other variables acting together 
are called problems of multiple correlation. The chances are good 
that most relationships are in fact multiple relationships; most 
effects probably have multiple causes. In some cases some one 
of the connected variables is so far and away more important than 
any of the others that we can neglect the others and determine the 
relationship of the effect to but one of the causes. 1 For example, 

1 Here we fall into the easy circumlocution of “cause and effect” without 
taking the trouble to explain their meaning. Again we mean merely that a 
knowledge of one or more variables is helpful in estimating the value of 
another variable. As to which variable is the cause and which the effect, 
or what is the meaning of cause and effect, we leave the problem to texts 
on logic and philosophy. 
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we often estimate the period of vibration of a pendulum from the 
length alone, neglecting variations caused by gravity, air resist¬ 
ance, etc. The length is relatively so important under ordinary 
circumstances that we feel safe in assuming that the period varies 
with the length alone. Such a relationship would be one of simple 
correlation. If we desired greater accuracy, however, and tried 
to estimate the period of a pendulum from its length, the pull of 
gravity, and the resistance of the surrounding medium, we should 
have a problem in multiple correlation. 

16.2. Dependent and Independent Variables.—In problems 
of multiple correlation, we are dealing with situations that involve 
three or more variables. We are trying to make estimates of the 
value of one of these variables based on the values of all the others. 
The variable whose value we are trying to estimate is called the 
dependent variable , and the other variables, on which our estimates 
are based, are known as the independent variables . Again we 
emphasize, as we did in the case of simple correlation (see page 
293) that there is no problem of cause and effect involved in 
the dependence or independence of variables. It is merely a 
question of the usefulness of one variable in making estimates of 
another. The variable that we wish to estimate is automatically 
the dependent variable. We select as independent variables all 
the other variables which, in our opinion, will be of significant 
help to us in estimating its value. 

It should be obvious that the statistician himself chooses which 
variable is to be dependent and which variables are to be inde¬ 
pendent. It is merely a question of the problem being studied. 
If we are trying to determine the most probable weight of men, 
we make weight the dependent variable and height, age, etc., the 
independent variables. If, on the other hand, we are interested 
in explaining (or estimating) height, we will make height the 
dependent variable and age, weight, etc., the independent 
variables. 

Problems of multiple correlation always involve three or more 
variables (one dependent and two or more independent). In 
order that we may distinguish them easily, we follow the custom 
of representing them by the letter X with various subscripts. 
The dependent variable is always denoted by Xi, and the others 
by X 2 , X z , etc. Thus in the height, weight, and age problem 
which we have just used for purposes of illustration, if we are 
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trying to estimate men’s weights (that is, if weight is the depend¬ 
ent variable) we might say: 

Xi = weight in pounds 
X 2 58 height in inches 
X 3 = age in years 

Any statistician who read these statements would know that 
weight was the dependent variable, since its subscript is the 
number 1 . 

Other correlation symbols are also changed where necessary. 
In simple correlation we have designated the coefficient of correla¬ 
tion by r and the index of correlation by p. The coefficient of 
multiple linear correlation is represented by R y and it is common 
to add subscripts designating the variables involved. Thus 
R 1.234 would represent the coefficient of multiple linear correlation 
between X\ on the one hand and X 2 , X s , and X 4 on the other, 
likewise pi .234 would represent the index of multiple correlation 
(used when the relationships are curvilinear) between Xi on the 
one hand and X 2 , X 3 , and X 4 on the other. The subscript of the 
dependent variable is always to the left of the point. The stand¬ 
ard error of estimate when the value of Xj is computed from the 
values of X 2 , X 3 , and X 4 would be represented by $ 1 , 234 . The 
variable estimated is designated by the number to the left of the 
point, and the variables on which estimates are based are to the 
right of the point. The standard deviation of the X\ variable 
would be represented by <n; and a 2 and a 3 would represent the 
standard deviations of A r 2 and .V 3 , respectively. The arithmetic 
means of Xi and X 2 would be shown as AT and X 2 or as M\ and 
M 2 , respectively. 

16.3. Multiple-regression Equations.—The regression equa¬ 
tions which we used in simple linear correlation were rather simple 
algebraic equations which (after we had evaluated the parameters) 
had but two unknowns, X and Y. For example, such an equa¬ 
tion might be 

K = 12 - 3 X 

This equation showed us how to vary our estimate of Y when the 
value of X varied. In fact, the above equation tells us to reduce 
the value of Y three units whenever we increase the value of X 
one unit. 
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When we came to parabolas of higher degrees, the regression 
equations became somewhat more complex because we included 
in them one or more of the higher powers of X as well as the first 
power. But still there were but two variables involved, X and F. 
For example, such an equation might be 

F - 7 - 2X + 5X 2 

This equation also tells us exactly how to vary our estimates of Y 
as the value of X varies. 

The multiple-regression equation must obviously be altered 
so that we can account for changes in all the independent vari¬ 
ables. The value of the dependent variable is to depend on the 
values of several other variables. This fact, however, requires 
no major alteration in the regression equation. It requires merely 
that we add terms for the new variables. If we are to use two 
independent variables, with the dependent variable being repre¬ 
sented by Xi and the other two variables by X 2 and X 8 , our equa¬ 
tion might be something like this: 

X x * 3 + 2X 2 - 3X S 

Now if the value of X 2 is 10 and the value of X 3 is 7, we can sub¬ 
stitute to get 

Xi = 3 + 2(10) - 3(7) - 2 

In this case we have estimated the value of X x from known values 
of X 2 and X 3 . The nature of the problem is exactly the same as 
in simple correlation. 

It will be noted that in the above sample multiple-regression 
equation we are told that if X 2 — 0 and X 3 = 0 , Xi will equal 3. 
(Try substituting 0 for A r 2 and X 3 , and solving.) We are told 
likewise that each increase of one unit in X 2 will bring an increase 
of two units in Xi and that each increase of one unit in X 8 will 
bring a decrease of three units in Xi. In other words, the 
parameters of the multiple-regression equation tell us the same 
type of thing that we are told by the parameters of the simple- 
regression equation. 

The multiple linear regression equation is always of the general 
form 

Xi - a + 6 aX 2 + 6 *Xg + 64 X 4 + * • • 
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The parameters 6 2 , 6 S , 64, etc., are called the regression coefficients. 
Strictly speaking, we should give them more subscripts so that 
we can tell not only with which independent variable they are 
connected, but also which variable is dependent and what other 
variables are independent. For example, if we have two inde¬ 
pendent variables and one dependent variable (the latter being 
Xi), we should write our complete multiple linear regression 
equation thus: 

Xi — fli.23 + b 12 . 3 X 2 + b n.iXz 

The parameter 612.3 is called the regression coefficient of X% on X 2 
with X 3 held constant. It tells us the amount by which Xi will 
vary for each unit’s change in X 2 if there are no changes in the 
value of X 3 . Likewise 613.2 tells us the number of units by which 
Xi will change for each unit of change in X 3 if there are no changes 
in X 2 at the same time. Consequently 6 i 3 . 2 is called the regres¬ 
sion coefficient of Xi on X 8 with X 2 held constant. 

These terms sound formidable, but they represent no ideas 
which we have not encountered in the case of simple linear corre¬ 
lation except the idea of “holding constant.” Obviously we can¬ 
not say what will happen to X x when we change X 2 unless we 
know that there are no variations in X 3 at the same time. We 
do not know what will happen to the period of the pendulum when 
we change its length unless we hold gravity constant. But if we 
can hold gravity constant, then we can say that the addition of 
1 in. to a pendulum whose present length is 13 in. will have a 
definite effect on the period. It may be that we shall have to hold 
two or three or more other factors constant. If we had five inde¬ 
pendent variables, one of our regression coefficients would be 
612 . 3466 * This would be the regression coefficient of Xi on X 2 with 
X 3 , X A , X 6 , and X 6 held constant. It would represent the same 
type of thing as a statement regarding the effect of changing the 
length of a pendulum while we held constant the pull of gravity, 
the temperature, the barometric pressure, and the resistance of 
the surrounding medium. 

If we use all the subscripts, then, our typical regression equa¬ 
tion for multiple linear correlation involving two independent 
variables will be 


Xi = 01.23 4* b 12 . 3 X 2 + 613 . 2 X 3 
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Our problem would be that of evaluating the parameters in order 
that we might state the equation in a form like this: 

Xi = 6 + 17 X 2 - 1.543X 8 

With three independent variables, the equation would become 

Xi — <21.234 + b 12.94X2 + b 13 . 24 X 3 + &H. 23 X 4 

The extension for a greater number of variables is obvious. As 
long as the relationship is linear, each X (that is, A r 2 , X 8 , etc.) 
will appear in the equation but once and always in the first power. 
The major problem is that of determining the values of the regres¬ 
sion coefficients and of the a term so that it will be possible to 
make estimates. 

16.4. Types of Relationship. —In the regression equations of 
which we have just been speaking it is evident that every addition 
of one unit to any independent variable has the same effect 
regardless of the size of the dependent variable. Thus, if our 
regression equation is 

Xt « 5 + 2 . 4 X 2 + 1.5X3 

it is clear that whenever we add one unit to X 2 (X 3 remaining the 
same) we increase the value of X] by 2.4 units. This is true 
regardless of the size of X 2 . Similarly each increase of one unit 
in X 3 brings an addition of 1.5 units in Xi regardless of the size of 
X 8 . When our multiple-regression equation is of this type, w r e 
say that the relationship is linear. The equation corresponds to 
that for simple linear correlation, in which the independent varia¬ 
ble appears but once and in the first power. Just as the simple 
linear regression equation can be represented by a straight line, 
so the multiple linear regression equation involving two inde¬ 
pendent variables can be represented by a plane. 

For example, suppose that we are studying the effect of varia¬ 
tions in temperature and in rainfall on the yield of potatoes. If 
the relationship is linear, it can be described by some such plane 
as that in section A of Fig. 16.1. In this chart is shown a solid. 
The height of the solid at any point represents the value of the 
dependent variable (potato yield). The scale along the lower 
right-hand edge represents the values of X 2 (rainfall), and the 
scale along the lower left-hand edge represents values of X 8 
(temperature). It will be noted that the values of Xi are repre- 
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sented by a plane which slopes from the back corner toward the 
front corner. As we move along the X 3 scale from left to right 
(that is, as the value of X 3 increases while the value of X 2 is fixed), 
the values of Xi (heights of the solid) diminish. Moreover, they 
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Fig. 16.1. Typical cases of multiple linear correlation, A; multiple curvi¬ 
linear correlation, B; and joint correlation, C. 

diminish regularly along a straight line, and everywhere along 
lines of the same slope, no matter at what point we hold X 2 
constant. 

Now let us move along the X 2 scale from the front of the solid 
to the back. It will be seen that, as the values of X 2 increase, the 
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value of Xi likewise increases regularly along a straight line. 
Regardless of the point at which we hold X 8 constant, the slope 
of this line is the same; that is, if we pass many vertical planes 
through the solid parallel to the X 2 scale, they will all cut the 
upper plane in parallel straight lines. This is always true of 
linear multiple correlation. Just as the parameter b in the simple 
linear regression equation Y = a + bX shows the slope of the 
regression line, so the parameter 6 l2 . 3 in the multiple linear regres¬ 
sion equation shows the slope of any line on the regression plane 
which is parallel to the X 2 scale, and the parameter b 13.2 shows the 
slope of any line on the regression plane which is parallel to the 
Xz scale. If the correlation is linear, all these lines will be 
straight—that is, the regression surface (the surface of the solid 
pictured) will be a plane surface. 

Multiple curvilinear correlation exists when the relationship 
between one or more of the independent variables and the depend¬ 
ent variable is curvilinear. The situation can be pictured as in 
section B of Fig. 16.1. The variables here are the same as before, 
but now it will be noted that the surface is no longer a plane. It 
will be seen that as one moves along the Xz axis from left to right 
the values of Xi diminish, and that they diminish everywhere 
along straight lines which are parallel. Thus, the relationship 
between X 3 and Xi is linear, as in the preceding case. But when 
one moves along the X 2 axis from front to back, the surface is 
curved. At first the values of Xi fall somewhat, but thereafter 
they rise more and more rapidly. It will be noted, however, that 
no matter where we start to cross the surface, as long as we cross 
it in a direction parallel to the -Y 2 axis w r e shall always pass over 
the same type of curve. We shall always start by walking down¬ 
hill for a way, after which we shall start up a hill which becomes 
increasingly steep. In other words, all the paths across the sur¬ 
face which are parallel to the X 2 axis describe parallel curves. 
In multiple correlation this is always true. We can generalize by 
stating that in multiple correlation the relationship between any 
one independent variable and the dependent variable (all other 
variables being held constant) is always the same regardless of 
the point at which the other variables are held constant. 

In the present case it looks as though the relationship between 
X 2 and Xi when X 3 is constant could be described by a second- 
degree parabola. All the curves on the surface are the same, and 
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each has but one bend. Hence the relationship between X 2 and 
X\ could presumably be shown by a regression equation of the 
general type 

Xi « ai.2 + buXz + bn'Xt 2 

This is the general formula for the second-degree parabola adapted 
to our new symbolism. The relationship between X 3 and X 1 can, 
on the other hand, be depicted by a straight line of the general 
formula 

X\ = ai .3 + b 13 X 3 

The relationship of X 2 and X 3 together to Xi could, then, be 
shown by taking the sum of these two relationships. However, 
since this operation would give us two constant terms (ai . 2 and 
ai.s), and since the sum of the two constant terms will likewise be 
constant, we can substitute the new constant a !.23 for the sum of 
the other two. Now if Xi is to be estimated from both together, 
it will be estimated by some such equation as 

Xi «= ai . 23 + b 12 . 3 X 2 + &i 2 '.aX 2 2 + b 13 . 2 X 3 

This would be a regression equation for multiple curvilinear cor¬ 
relation. The equation for the plane surface of multiple linear 
correlation would be of the general form 

Xi = ai .23 + 5i2. 3 X 2 + b 13 . 2 X 3 

It will be seen that this is the same type equation that we men¬ 
tioned on page 497. 

There still remains one type of correlation to be described. It 
is quite possible that rainfall and temperature will both be related 
to potato yield, but that the relationship between rainfall and 
potato yield will vary according to the temperature. It is possi¬ 
ble that greater rainfalls might be advantageous if the tempera¬ 
ture were high but disadvantageous if the temperature were low. 
Thus the line or curve showing the relationship between rainfall 
and yield, with temperature constant, would differ according to 
the point at which we held the temperature constant. In such a 
case we should say that there was joint correlation between the 
variables. Such a situation is pictured in section C of Fig. 16.1. 
It will be noted that, as one passes along the right-hand front 
edge of the surface parallel to the X 2 axis, the values of Xi 
increase. When one passes along the left-hand back edge of the 
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surface parallel to the X 2 axis, the values of Xi decrease. If we 
let Xi represent potato yield, X 2 the rainfall, and X 3 the tempera¬ 
ture as before, this is equivalent to saying that with high tem¬ 
peratures increasing rainfall increases the yield, but with low 
temperatures increasing rainfall lowers the yield. In such a case 
the surface is warped , and we say that the correlation is joint. 
Joint correlation exists, then, when the nature of the relationship 
between an independent variable and the dependent variable 
differs according to the size of some other independent variable, 
just as here the relationship between potato yield and rainfall 
differs at different temperatures. There is no simple type of 
regression equation which we can use to describe the joint cor¬ 
relation surface, and the methods of handling problems of joint 
correlation must be left to more advanced treatises . 1 

We can classify relationships, then, as follows; 

1. Simple correlation (but two variables involved). 

a. Linear (regression line straight). 

6. Curvilinear (regression hue curved). 

2. Multiple correlation. (Two or more independent variables. Rela¬ 
tionship of each independent to the dependent constant/ regardless of size of 
other independents.) 

o. Linear (described by a plane). 

b. Curvilinear (described by a regular curved surface). 

3. Joint correlation. (Two or more independent variables. Relation¬ 
ship of one or more of the independent variables with the dependent variable 
varies according to the value of other independent variables. Described 
by warped surface.) 

16.6. Methods of Computation. —It will be impossible for us to 
give a detailed description of multiple-correlation analysis here, 
but we can sketch briefly the processes involved. 

The regression equation for linear multiple correlation is dis¬ 
covered by solving normal equations which are similar to, and 
derived by the same process as, the normal equations of simple 
correlation. If we have two independent variables, X 2 and X 3 , 
and if the dependent variable is, as is customary, represented by 
Xi, the normal equations for linear relationships are 

Na + & 2 SX 2 + 6 3 2X 3 = 2 Xi 
aSX 2 + 6 2 SX 2 2 + 6 3 2 (X 2 X 3 ) - X(XiX t ) 
aXXz + 6 2 2 (X 2 X 3 ) + 6 3 2(X 8 2 ) - 2 (XjX,) 

1 See especially M, J. B. Ezekiel, “ Methods of Correlation Analysis,” 
Chaps. 21 and 22, John Wiley & Sons, Inc., New York, 1941. 
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One must find the values of N } 2 Xi, SX 2 , 2 )X 3 , 2 (X 2 2 ), 2 (X 3 2 ), 
^(XiXa), 2 (XiX 8 ), and S(X 2 X 8 ). These values are substituted 
in the normal equations, and the equations are then solved for 
values of a, 6 2 , and 63 . The two latter parameters are really the 
coefficients of regression 6 12 . 3 and 613 . 2 , but here we are using the 
shorter form. We can illustrate the method of applying the 
equations by solving the hypothetical case in Table 16.1. We 
shall use but five sets of observations, although no one would in 
practice apply such complicated methods to so few cases. Here 
the figures are given merely for illustrative purposes. One can 
assume, if one wishes, that Xi is potato yield, X 2 is rainfall, and 


Table 16.1.— Computation of Multiple Linear Regression Equation 


X1 

x a 

X, 

x a * 

X s ’ 

x t x 2 

XxX, 

X*Xa 

1 


2 

9 

4 

3 


6 

2 


4 

16 

16 

8 


16 

3 


9 

4 

81 

6 


18 

4 

1 

13 

1 

169 

4 


13 

5 

5 

12 

25 

144 

25 

60 

60 

15 




414 

46 

149 

113 


X 3 is temperature, as in the illustrations on page 498. Substitut¬ 
ing these totals in the normal equations (and remembering that 
N = 5), we get 

5a + 156 2 + 406 3 = 15 
15a + 556 2 + 113bi = 46 
40a + 1136 2 + 4146a = 149 

Solving these three equations simultaneously, we have the fol¬ 
lowing values of the parameters: 

a — —0.66666 • • • 

62 - +0.33333 • * • 

63 - +0.33333 • • • 

Substituting these values in the type equation, we get 
X x = -0.6667 + 0.3333X 2 + 0.3333X* 
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In this case the equation could more easily be put in fractional 
form, thus: 



If we have three independent variables, the normal equations 
become 

Na + 622 X 2 632 X 3 + 642 X 4 = 2 X 1 

a2X 2 + 6 2 2 (X 2 2 ) + 6 3 2(X 2 X 3 ) + 6 4 2 (X 2 X 4 ) = 2(XjX 2 ) 
a2X3 + 6 2 2(X 2 X 3 ) + 6 3 2(X 3 2 ) + b^X^X*) - 2(X 1 X 3 ) 
a2X4 + 6 2 2 (X 2 X 4 ) + 6 3 2 (X 8 X 4 ) + 6 4 2 (X 4 2 ) - 2(XxX 4 ) 

The extension of these equations for a larger number of independ¬ 
ent variables is obvious. 

The method of handling curvilinear multiple correlation 
depends on the type of curve to be fitted. One would have to 
determine this, as in other cases, by preliminary classification 
of the data. It would be necessary, however, to choose cases 
in which the values of X z (for example) were about equal, and 
to determine by grc ip averages or scatter diagrams of these 
cases the type of curve that described the relationship of X 2 and 
Xu If a parabolic curve could be used, the method would be a 
simple extension of that just given. For example, if we have a 
relationship similar to that pictured in section B of the chart on 
page 499, we know that the relation between X x and X 2 (when 
X 3 is constant) can be represented by a second-degree parabola, 
while the relationship between Xi and X 8 (when X 2 is constant) 
can be represented by a straight line. As was pointed out on 
page 501, under such circumstances the type equation will be of 
the general form 

X\ = a + 6 2 X 2 + 6 2 /X 2 2 + 63X3 
The normal equations for such a surface are 

Na + 6 2 2 X 2 + hr 2(X 2 2 ) + 6*2X 3 - 2Xx 
a2X 2 + 6 2 2 (X 2 2 ) + 6 2 '2(X 2 3 ) + 6 3 2(X 2 X 3 ) - 2(X 1 X 2 ) 
a2(X 2 2 ) + 6 2 2(X 2 3 ) -f 6 2 '2(X 2 4 ) + 6 3 2 (X 2 *X 8 ) - 2(XiX 2 2 ) 
o2X 3 + 6 2 2(X 2 X 3 ) + 6 2 '2(X 2 2 X 3 ) + 6 3 2(X 3 4 ) - 2 (X 1 X 8 ) 

Again we shall illustrate with a short hypothetical example. 
Letting X x represent potato yield, X 2 the rainfall, and X 8 the 
temperature, suppose that we obtain the data of Table 16.2. 
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Table 16.2. —Computation of Multiple Curvilinear Regression 

Equation 


a 

x a 

X, 

X a * 

x,» 

x,« 

x a * 

1 * 

X x X 2 

X 2 *X, 

XxX 2 * 

XJ, 

9 

1 

2 

1 

4 

1 

1 

2 

9 

2 

9 

18 

12 

2 

3 

4 

9 

8 

16 

6 

24 

12 

48 

36 

13 

3 

6 

9 

36 

27 

81 

18 

39 

64 

117 

78 

32 

4 

1 

16 

1 

64 

256 

4 

128 

16 

512 

32 

29 

5 

8 

25 

64 

125 

625 

40 

145 

200 

725 

232 

95 

15 

20 

55 

114 

225 


70 



1411 

396 


Substituting these totals in the normal equations, we have 

5a -f 156 2 + 55 b r + 206* - 95 

15a + 555 2 + 225 br + 706 3 - 345 

55a + 2256, + 9796 2 > 4- 2846 3 = 1411 
20 a + 706 2 + 2846 2 ' + 1146 3 - 396 

If we solve these equations we find the following values of the 
parameters: 

a = 10 b* = 1 

6 , — 2 63 — —2 

The regression equation, then, becomes 

X, == 10 + 2X, 4 - AY - 2 X 3 

With more complicated curvilinear relationships it becomes neces¬ 
sary to derive the necessary normal equations each time for the 
particular problem in hand. With the aid of the calculus and 
the examples here given, this should not be a difficult task. 

The constants of the regression equation are interpreted as in 
simple correlation. In the numerical illustration of multiple 
linear correlation on page 504, we found the equation 

v - , Xt , X 8 

Xl 3 + 3 + 3 

In terms of decimals this gives approximately 

Xx = -0.67 + 0.33X2 4- 0.33X3 

The first constant, a, is equal to —0.67. This tells us that Xi 
will have a value of —0.67 when the two independent variables 
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are both equal to 0. In the terms of our problem this would mean 
that with a rainfall of 0 and a temperature of 0 the potato crop 
would be —0.67 bushels. This makes no sense. In such cases 
we think of this constant merely as one which defines the height 
of the regression plane, and we do not try to give it further 
significance. The next constant ( 612 . 3 ) is 0.33. This tells us 
that every increase of one unit in the size of X 2 brings an increase 
of one-third of a unit in the size of X x if there is no change in the 
size of X%. That is, if temperature is constant, each additional 
inch of rainfall will add one-third of a hundred bushels to the 
potato yield (assuming now that 100 bushels is the unit in which 
potato yield is measured). Similarly the third constant ( 613 , 2 ) 
is 0.33. This means that each unit of increase in the size of X% 
(Xi remaining fixed) is accompanied by an increase of one-third 
of a unit in the value of X x . In the terms of our problem, if we 
consider only changes in temperature, with no variations in 
rainfall, each additional degree of temperature is accompanied 
by an increase of one-third of a hundred bushels (that is, one- 
third of a unit) in the size of the potato yield. 

16.6. Effects of Variables Separately.—Here rainfall and tem¬ 
perature seem equal in their effect on yield, since a change of one 
unit in either is accompanied by a change of a third of a unit in 
yield. Their effects seem to be exactly the same. But it may 
be true that a difference of 1 ° in temperature from season to 
season is a very minor change, while a difference of 1 in. in rainfall 
is large. In other words, the fact that a change of one unit in the 
one brings the same result as a change of one unit in the other 
does not tell us enough. It is hard to compare a change of 1 ° in 
temperature with a change of 1 in. in precipitation. We have 
discovered, however, that measurements which were originally 
taken in different units can be compared if stated each in terms 
of its own standard deviation (see page 170). It is not uncom¬ 
mon so to state the regression coefficients; in this way we make 
them comparable. When regression coefficients are stated in 
terms of their standard deviations, they are called beta coefficients , 
and they are represented by the Greek letter beta followed by 
whatever subscripts followed the regression coefficient. For 
example, ffn.u ® bu.uicri/ei). Similarly 1813.2 = 613 . 2 ( 0 * 3 / 0 - 1 ). In 
our recent short numerical example we found the following values 
of the regression coefficients: 
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6at • 0.33 
5u.a *= 0.33 

Inspection of the original data in the table on page 503 shows 
that the standard deviations of the variables are 

= 1.4 
<^2 = 1.4 
<73 - 4.3 

In this hypothetical case, then, a variation of one unit in X% is 
much more likely to occur than is a variation of one unit in X 2 . 
If we put the regression coefficients in terms of the standard 
deviations, we get the following beta coefficients: 

$ 12,8 « 0.33 (j ~\ = 0.33 



The first of these figures tells us that each increase of one standard 
deviation in the value of Xt will be accompanied (if X z stays 
constant) by an increase of 0.33 standard deviations in the 
value of Xi. The second of the beta coefficient tells us that an 
increase of one standard deviation in the value of X 8 (X a staying 
constant) will be accompanied by an increase of 0.61 standard 
deviations in the value of Xi. Since a standard deviation of 
change is equally likely to occur in X 2 and in X z (if the dis¬ 
tributions are normal), it is evident that X z probably accounts 
for "much more of the actual change in Xi in our problem than 
does X 2 . In other words, variations in temperature have more 
effect than variations in rainfall in bringing about changes of 
potato yield. The multiple-regression equation may be written 
in terms of the beta coefficients rather than in terms of the 
coefficients of regression, 1 and the beta coefficients also become 

1 For example, the regression equation for multiple linear correlation 
with two independent variables is equally written 

Xi ■■ a -T bzX-i + biXs 

It may be written in terms of the beta coefficients thus (where A is a con¬ 
stant): 
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very useful in more advanced statistical work. Here we note 
merely the fact that they are useful in determining the relative 
importance of the various independent variables. The relative 
importance of the several independent variables is also shown 
by coefficients of partial correlation and by coefficients of part 
correlation. It is impossible for us to describe these coefficients 
here; the interested student is referred to more advanced works. 1 

16.7. Other Correlation Constants.—Having covered briefly 
the multiple regression equation, we now turn to the correlation 
coefficients or indices and the standard errors of estimate. These 
are determined by methods already familiar to us (see pages 458 and 
481#.). Having found our regression equation, we estimate each 
value of Xi from the known values of X 2 , X 3 , etc. We compare 
the estimated and the actual values of X x and compute the 
differences between them. The standard deviation of these 
differences is the standard error of estimate. If we have com¬ 
puted the linear multiple-regression equation, we usually denote 
the standard error of estimate thus: 

$1,234 

This would mean, “The standard error of estimating X x from 
X 2 , X 3 , and X 4 .” The first subscript indicates the dependent 
variable, and the subscripts following the decimal point indicate 
the independent variables. If the regression surface is curved, 
the standard error of estimate is denoted by the symbol 

$l./(234) 

The letter/signifies that we have used some function of variables 
2 , 3, and 4, but does not state what function. For further infor¬ 
mation one would have to consult the regression equation itself, 
which would usually be given. 

After we have found the standard error of estimate it is easy 
to find the coefficient of multiple correlation (if the relationship 
is linear) or the index of multiple correlation (if the relationship 
is curvilinear). The same formula is used for either: 


#1.284 = ^ 

[, $ Z 1.284 

Pi.284 =** 

h $*l./(284) 

f 1 —- — ...- - 

<r,* 


1 Especially good is Ezekiel, op . cit. 
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Thus the coefficient of multiple correlation (or the index of 
multiple correlation) is based on a comparison of the variability 
around the regression line (the errors made in estimating by 
means of the regression equation) and the variability about the 
mean of the dependent variable. These coefficients tell us, as 
have the others which we have studied, the extent to which our 
errors of estimate are reduced if the estimate is based on the 
regression equation rather than on chance. 

Enough has been said here as to the methods of computing the 
constants of multiple-correlation problems to give the student a 
fairly good idea of the concepts involved. It is not hoped to do 
more. Methods have been developed by which a considerable 
part of the mathematical work of multiple-correlation problems 
can be saved, and by which the accuracy of the work can be 
checked step by step. These methods are of tremendous impor¬ 
tance to anyone who goes at the problems of multiple correlation 
seriously. Other sources must be consulted for a description of 
these methods. 1 The purpose of the present chapter is merely 
that of acquainting the student with the concepts of multiple 
and joint correlation in the hope that he will understand simple 
correlation and the concepts of relationship, in general, better 
for having taken this brief journey into more complicated fields. 

16.8. Corrections for Degrees of Freedom.—In multiple- 
correlation problems the corrections for the number of cases and 
the number of parameters are as important as before. In fact, 
the correlations are likely to be larger in these cases because the 
addition of variables makes for the addition of parameters. 
The formulas by which the unadjusted coefficients are corrected 
are the same as those given heretofore on page 484. 

16.9. Standard Errors of Coefficients of Multiple Correlation.— 
The standard errors of coefficients of multiple correlation or 
indices of multiple correlation are computed by formulas similar 
to those used before, and their interpretation is unchanged. 
For example: 

_ 1 ~' R 2 1.234 

<r * ,m 

_ 1 P 2 1.234 

crp, ’“ Vn~=p 

1 Especially helpful is H. A. Wallace and G. W. Snedeoor, Correlation 
and Machine Calculation, Iowa Slate College Bulletin 35, 1931. 
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The probable errors are, as before, found by multiplying the 
standard errors by 0.6745. 

Here, as before, we are troubled by the fact that standard errors 
computed by these formulas are misleading unless the numbers 
of cases are very large. As the number of parameters is increased, 
it becomes more and more important that the number of cases 
studied be sizable. The large amount of arithmetical work 
involved makes it certain that statisticians will seldom work 
with more than five or six independent variables, although cases 
can be found where many more have actually been used. Pos¬ 
sibly we can make a rough statement about the reliability of 
multiple-correlation results, without being too far from the 
facts, if we say that in problems where the number of cases runs 
from 50 to 100 or more (preferably more) and where the number 
of independent variables is not over four or five, coefficients of 
multiple correlation greater than R = 0.5 are probably signifi¬ 
cant. However, such crude rules of thumb are seldom satisfac¬ 
tory, and should be used merely for a first rough check. For more 
accurate tests we fall back on the method of analysis of variance, 
which is based on a comparison of the dispersion in the original 
data and the dispersion around the regression surface. Unfortu¬ 
nately the methods involved are too complicated for us to take up 
here. 1 

16.10. Suggestions for Further Reading.—The problems of multiple 
correlation have been covered here in but the briefest summary form, in an 
attempt to give the elementary student some idea as to the possibilities of the 
method without making him expert in its application. Any attempt to 
apply these methods to actual problems should be preceded by further 
reading and study. The most helpful books are listed in Sec. 15.16. 

EXERCISES 

1. Give examples from the fields of physics, geometry, and other fields 
of simple and of multiple relationships. For example, if C represents the 
circumference of a circle and if D represents its diameter, we are told that 

C * 3.1416D 

This is an example of simple linear correlation. Find others, both simple 
and multiple. 

1 For complete treatment see R. A. Fisher, ** Statistical Methods for 
Research Workers/’ Oliver & Boyd, Edinburgh and London, 1938; or 
Wallace and Snedecor, op. eft. 



MULTIPLE CORRELATION 


511 


2 . In the example given in the preceding exercise, the circumference is 
treated as the dependent variable. How can you tell that this is true? Is 
the circumference any more dependent on the diameter than the diameter 
is on the circumference? If you were to treat the diameter as dependent, 
how would you change the statement of the problem? 

3. Suppose we are studying a problem in which there are three variables, 
as follows: 

X\ = yield of milk in pounds 

X* = pounds of grain fed per eow per day 

Xi *■* age of cow in years 

Suppose that we find the following regression equation: 

X, * 3 + 2.5X2 - 0.1X, 

a. Exactly what is the meaning of each of the three numbers in the regres¬ 
sion equation? 

b. Suppose that we have the following values: 

ffj = 4 <T j * 2 (Tg ® 1 

What are the beta coefficients? Interpret them. Does age or grain ration 
play the larger part in the fluctuations of milk yield? 

4 . In a problem of linear multiple correlation we find the value of 

612 s 1=1 3.65 

a. What is the meaning of each of the subscripts? 

b. What is meant by “holding a factor constant’'? 

c. Interpret the figure 3.65 above. 

5 . Describe in a paragraph each of the sections of the chart on page 499. 

6 . The formula showing the space (s) passed over by a falling body in a 
given time (/) under various gravitational attractions ( g ) is 

s = 

Is the relationship linear, curvilinear, or joint? Test it to see which it is 
by holding g constant at two widely different values and solving for various 
values of t. Plot the results. Do they give the same curve? 

7. On page 498 is the type equation for multiple linear correlation with 
three independent variables. Write the type equation for four independent 
variables. 

8 . Using the normal equations given in the text as samples (see pages 
502 and 504), write the normal equations for multiple correlation with 
four independent variables. 

9. In the table on page 503 are the figures for a multiple-correlation 
problem. The problem was solved with Xi as the dependent variable. 
Suppose that we wish to use X% as the dependent variable (that is, to sub¬ 
stitute the column heading Xi for the heading X*). Work out the multiple 
linear regression equation for these same figures but with the new dependent 
variable. Most of the needed totals are already computed in the table. 
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10. On page 503 are given the data of a multiple-correlation problem. 
On page 504 is given the regression equation for these data. Estimate the 
values of Xi from the regression equation, find the differences between the 
actual and the estimated values of A r i, and compute Si.™ and Ri 23 . 

11 . In the light of the correction formula on page 484 and the regression 
equation on page 504, explain why one would not use such complicated 
methods with but five cases, as in the illustration. 

12. The figures of Table 16.3 show the average weight of men of various 
heights and ages . 1 This is obviously a case of either multiple or joint 
correlation, since there are two independent variables (age and height) 
and one dependent variable (weight). Cross-classifications such as this 
would be used by a statistician to determine the nature of the relationship. 

Table 16.3.— Average Weight of Men of Various Heights and Ages 


Height (feet and inches) 


Age 

Group 

5' 

6 ' 2 " 

5'4" 

5'6" 

5'8" 

5'10" 

6 ' 0 ” 

6 ' 2 " 

15-19 

113 

118 

124 

132 

140 

148 

158 

168 

20-24 

119 

124 

131 

139 

146 

154 

103 

173 

25-29 

124 

128 

134 

142 

150 

158 

169 

181 


127 

131 

137 

145 

154 

163 

174 

186 

35-39 



140 

148 

157 

167 

178 

191 

40-44 

132 

136 

142 

150 


169 

181 

194 

45-49 

134 

138 

144 

152 

161 

171 

183 

197 


135 

139 

145 

153 

162 

172 

184 

198 


a. Is the relationship linear or curvilinear? Plot the figures of one or 
two of the columns on cioss-seetion paper. Do they yield straight lines or 
curves? (Minor random variations fiom straight lines would not give 
evidence of curvilinearity. Look for regular and consistent deviations 
from a straight line.) Plot the data of two or three of the rows of figures. 
Do they yield straight lines or curves? Are the data linear in both direc¬ 
tions, in one direction, or in no direction? 

b. Is the relationship multiple or joint? Test by plotting the data from 
various rows on the same piece of cross-section paper, seeing whether they 
yield approximately parallel lines. Plot also the data of the various columns 
on another piece of cross-section paper and see if they yield approximately 
parallel lines. If in both cases the lines are approximately parallel, the 
relationship is multiple. If m either case (or both cases) there is a tendency 
for the shape of the lines to change as we go from row to row or column to 
column, the relationship is joint. 

c. If in either case there are consistent and parallel curves, w'hat is their 
general nature? Are they parabolic or logarithmic, or do they resemble 
any of the type curves we have plotted elsewhere? 

1 Based on figures in u World Almanac,” p. 809, 1934. 
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APPENDIX I 

Areas under the Normal Curve 

Fractional parts of the total area (1.000) under the normal curve between the mean and a 
perpendicular erected at various numbers of standard deviations (sc/<r) from the mean 1 
To illustrate the use of the table, 39.005 per cent of the total area under the curve will lie 
between the mean and a perpendicular erected at a distance of 1.23<r from the mean. 

Each figure in the body of the tabic is preceded by a decimal point 


x/a 

*00 

rd 

02 

03 

.04 

.05 

.06 

.07 

.08 

.09 

0 0 

00000 

00390 

00798 

01197 

01595 

01904 

02392 

02790 

03188 

03586 

0 1 

03983 

04380 

04776 

05172 

05567 

05962| 

06356 

06749 

07142 

07535 

0.2 ! 

07926 

08317 

08706 

09096 

09483 

09871 

10257 

10642 

11026 

11409 

0 3 

11791 

! 12172 

12552 

12930 

13307 

13683 

14058 

14431 

14803 

15173 

0 4 

15554 

15910 

16276 

16640 

17003 

17364 

17724 

18082 

18439 

18793 

0.5 

19146 

19497 

19847 

20194 

20450 

20884 

21226 

21566 

21904 

22240 

0.6 

22575 

22907 

23237 

23565 

23891 

24215 

24537 

24857 

25175 

25490 

0 7 

25804 

26115 

26424 

26730 

27035 

27337 

27637 

27935 

28230 

28524 

0.8 

28814 

29103 

29389| 

29673 

29955 

30234 

30511 

30785 

31057 

31327 

0 9 

31594 

1 31859 

3212 L 

32381 

32639 

32894 i 

33147 

33398 

33046 

33891 

1.0 

34134 

34375 

34614 

34850 

35083 

35313 

35543 

35769 

35993 

30214 

1.1 

36433 

1 36650 

36864 

37076 

37286 

37493 

37698 

37900 

38100 

38298 

1.2 

38493 

38688 

38877 

39065 

3 D 251 

39435 

39617 

39796 

39973 

40147 

1.3 

40320 

40490 

40658 

40824 

40988 

41149 

41308 

41466 

41621 

41774 

1.4 

41924 

42073 

42220 

42364 

42507 

42647 

42786 

42922 

43056 

43189 

1.5 

43319 

[ 43448 

43574 

43699 

43822 

43943 

44062 

44179 

44295 

44408 

1.6 

44520 

I 44830 

44738 

44845 

44950 

45053 

45154 

45254 

45352 

45449 

1.7 

45643 

45637 

45728 

45818 

45907 

45994 

46080 

46164 

46246 

46327 

1.8 j 

46407 

46485 

46502 

46638 

46712 

46784 

46856 

46926 

46996 

47002 

1.9 

47128 

47193 

47257 

47320 

47381 

47441 

47500 

47558 

47615 

47670 

2.0 

47726 

47778 

47831 

47882 

47932 

47982 

48030 

48077 

48124' 

48169 

2.1 

48214 

48257 

48300 

48341 

48382 

48422 

48461 

48500 

48537 

48574 

2 2 

48610 

48645 

48679 

48713 

48745 

48778 

48809 

48840 

48870 

48899 

2 3 

48928 

48956 

48983 

49010 

49036 

49061 

40086 

49111 

49134 

49158 

2.4 

49180 

49202 

49224 

49245 

49266 

49286 

49305 

49324 

46343 

49361 

2.5 

49379 

49396 

49413 

49430 

49446 

49461 

49477 

49492 

49506 

49520 

2 6 

49534 

49547 

49560 

49573 

49585 

49598' 

49609 

49621 

49632 

49643 

2.7 

49653 

49664 

49674 

49683 

49693 

49702 

49711 

49720 

40728 

49736 

2 8 

49744 

49752 

49760 

49767 

49774 

49781 

49788 

49795 

49801 

49807 

2 9 

3 0 

3.5 

4.0 

4.5 

6.0 

49813 

49865 

4997674 

4999683 

4999966 

4999997133 

1 49819 

49825 

49831 

40836 

49841 

49846 

49851 

49856 

49861 


1 This table has been adapted, by permission, from F. C. Kent, “Elements of Statistics,” 
McGraw-Hill Book Company, Inc., 1924. 
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APPENDIX II 

Ordinates or the Normal Curve 

Ordinates (heights) o! the unit normal curve. 1 The height (y) at any number otf standard 
deviations (x) from the mean is 

— **/2 

y - 0.39890 


To obtain answers in units of particular problems, multiply these ordinates by N(Ct)/9 
where N is the number of cases, Ct the class interval, and e the standard deviation. 

Each figure in the body of the table is preceded by a decimal point. 


x/«r 

,00 

! 

.01 

02 

.03 

04 

.05 

06 

.07 

,08 

.09 

0 0 

39894 

39892 

39886 

39876 

39862 

39844 

39822 

39797 

39767 

39733 

0.1 

39695 

39654 

39608 

39559 

39505 

39448 

39387 

39322 

30253 

30181 

0 2 

39104 

39024 

38940 

38853 

38762 

38067 

38568 

38400 

38361 

38251 

0 3 

38139 

38023 

37903 

37780 

37654 

37624 

27301 

37255 

37115 

36973 

0 4 

36827 

36678 

36526 

36371 

36213 

36053 

35889 

! 

35723 

35553 

35381 

0 5 

35207 

35029 

34849 

34667 

34482 

34294 

34105 

33912 

33718 

33521 

0 6 

33322 

33121 

32918 

32713 

32506 

32297 

32086 

31874 

31659 

31443 

0 7 

31225 

31000 

30785 

30563 

30339 

30114 

29887 

29668 j 

29430 

29200 

0 8 

28969 

28737 

28504 

28269 

28034 

27798 

27562 

27324 

27086 

26848 

0 9 

26609 

26369 

26129 

25888 

25647 

25406 

! 

25164 

24923 

24681 

24439 

1.0 

24197 

23955 

23713 

23471 

23230 

22988 

22747 

22506 

22265 

22025 

1.1 

21785 

21546 

21307 

21069 

20831 

20594 

20357 

20121 

19886 

19652 

1 2 

19419 

19186 

18954 

18724 

18494 

18265 

18037 

17810 

17585 

17360 

1 3 

17137 

16015 

16694 

16474 

16256 

16038 

15822 

15008 

15395 

15183 

1 4 

14973 

14764 

14556 

14350 

14146 

13943 

13742 

13542 i 

13344 

18147 

1.5 

12952 

12758 

12566 

12376 

12188 

12001 

11816 

11632 

11450 

11270 

1.6 

11092 

10915 

10741 

10567 

10396 

10226 

10059 

09893 

09728 

09566 

1.7 

09405 

09240 

09089 

08933 

08780 

08628 

08478 

08329 

08183 

08038 

1.8 

07895 

07754 

07614 

07477 

07341 

07206 

07074 

06943 

06814 

06687 

1 9 

06562 

06438 

06316 

06195 

06077 

05959 

05844 

05730 

05618 

05508 

2 0 

05399 

05292 

05186 

05082 

04980 

04879 

04780 

04682 

04586 

04491 

2.1 

04398 

04307 

04217 

04128 

04041 

03955 

03871 

03788 

03706 

03626 

2 2 

03547 

03470 

03394 

03319 

03246 

03174 

03103 

03034 

02965 

02898 

2.3 

02833 

02768 

02705 

02643 

02582 

02522 

02463 

02406 

02349 

02294 

2.4 

02239 

02180 

02134 

02083 

02033 

01984 

01936 

01888 

01842 

01797 

2.5 

01753 

01709 

01667 

01625 

01585 

01545 

01506 

01408 

01431 

01394 

2.8 

01358 

01323 

! 01289 

01256 

01223 

01191 

01160 

01130 

01100 

01071 

2.7 

01042 

01014 

00987 

00961 

00935 

00909 

00885 

00861 

00837 

00814 

2,8 

00792 

00770 

00748 

00727 

00707 

00687 

00668 

00649 

00631 

00613 

2.9 

8.0 

8.5 

4.0 

4.5 

5.0 

00595 

00443 

0008727 

0001338 

0000160 

000001487 j 

00578 

00562 

00545 

00530 

00514 

00499 

00485 

00470 

00457 


1 This table adapted, by permission, from. Rent, “ Elements of Statistics." 
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5 and 1 Per Cent Significance Points of F* 

5 per cent points are in roman type; 1 per cent points are in boldface type. 
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♦ Reproduced through the courtesy of the author and the publisher from G. W. Snedecor, Statistical Methods, Collegiate Press, Inc., of Iowa State College, 
Amee, Iowa, 1946, Table 10 7 




5 and 1 Peb Cent Significance Points of F.—( Continued ) 

5 per cent points are in roman type; 1 per cent points are in boldface type. 









$ 

9 

s 

3 

8 

S 

g 

8 

s 

§ 

g 

1 

% 

I 

6 

33 

ft ft 

33 

ft ft 

32 

q3 

ft ft 

33 

ft ft 

58 

fi *4 

<0 S 

S3 

»H H 

ap 

ft ft 

sp 

H «4 

52 

sa 

H w4 

2 ^ 
ft ft 

23 

8. 8 
ft ft 

60 MO 

ft ft 

25 

35 

3 3 

iH <M 

5 8 

88 

t-J H 

53 

t-i ft 

33 

*h ft 

gp 

rH 

ft ft 

© N 

CM ^ 

ft ft 

sp 

pH 

*6 

H H 

ft *4 

^ 8 
ft fi 

© s 

r-5 H 

S3 

oo to 

* »4 

to fi 

r»J b» 

J ft 

3$ 

*•4 rt 

S3 

ft «-t 

S3 

i-i ^ 

35 

ft ft 

35 

f< ft 

5 3 

*-t ft 

83? 

ft «-i 

sp 

ft ft 

sp 

ft *>t 

H <H 

55 

S3? 

fi H 

S3 

fi fi 

S3 

ft ft 

S3 

H H 

00 <f 

tf b- 
ft ft 

© ft 

*f t-; 

<—* ft 

iSS 

3 8 

23 

83 

3 «9 

ft fi 

S3 

QO CM 

fi fi 

SP 

ft ft 

38 

r-« ft 

sa 

ft ft 

S3 

H H 

S3 

ft fi 

S3 

rH r-i 

s? 

r-» <H 

© to 

tr t» 

ft ^ 


© o 

-Jt 

ft ft 

S3 

m-4 H 

© Ot 

CO to 
fi ft 

5 3 

ft fi 

© w 
CO to 

ft ft 

ft ft 

S3 

fj ^ 

83 

H H 

S3 

r-4 H 

to 3 

ft ft 

S3 

r-4 «H 

88 

i-t *4 

S3 

ft ft 

io 3 

CO CM 

to cp 

ft M0 

•o 

ft ft 

8 8? 

»H fb 

8 8 
rH iH 

38 

ft «4 

32? 

25 

ft *4 

CO 3 

fi fi 

2 8 I 

S3 

ft M 

SS 

ft CM 

28 
ft Of 

ft tO 

to to 
fi *4 

S3 

H H 

b- O 

to 0 » 

ft ft 

S3 

ft ft 

3 3 

r-4 tH 

v-H 

© M9 
m*j e- 

fi ft 

53 

r-4 

3g 

S3 

ft fi 

ft ft 
t* « 

ft ft 

O Ot 
•# to 

ft fi 

ss 

ft OM 

O ft 
b- ft 

ft CM 

co o 

to ft 

-4 CM 

52 

8 8 
ft CM 

3S 

ft CM 

S3 

83 

ft fi 

N 4B» 

© op 

ft ft 

© 5 

3 8 

r4 H 

S? 

S3 

H H 

b- ft 
-Mt b; 

ft ft 

to Ot 

tr • 

sa 
»* 66 

s: a 

i-i CM 

'■** 00 

t> ft 

-J oti 

CM M» 

«> ft 

ft CM 

S3 

ft CM 

88 

ft cm' 

5 5 

ft CM 

88 

ft OM 

!!! 

8 S 

r* H 

© ft 

© at 

ft ft 

sT 

ft fi 


© op 

-2 

s§ 

f CM 

sa 

00 to 
r ~ «. 

— CM 

£8 
ft CM 

S3 

ft CM 

S3 

f* CM 

CM IO 
N ft 

ft CM 

S3 

ft oti 

ss 

ft CM 

28 
ft CM 

38 

-t CM 

© ^ 
© o> 

S3 

-4 ft 

2? 

© 5 

ft «4 

5 3 

** cm 

£3 

ft CM 

S 8 

-t eti 

to to 

o6 w 

ft CM 

ft CM 

00 CO 
ft CM 

8 8 
ft CM 

r a 

ft 0M 

i> 

r- cm 

-t CM 

© Ot 
b- ft 
ft CM 

CM © 

N H 
fi CM 

sa 

— CM 

© 9 

© 5 

f CM 

53 

ft CM 

2 © 

ft oi 

33 

1 91 

2.60 

ITsT 

fi M 

~8S~ 

<•* oti 

S3 

rH CM 

to © 

00 5 

ft CM 

25 

f* CM 

38 

ft CM 

CM CM 

oo n 

— CM 

© to 
t- CM 

-4 CM 

58 

f« CM 

sa 

ft cm 

-ft b» 
t- ft 
-t CM* 

8 8 
ft 0M 

ss 

ft CM 

S 5 

f« CM 

1 97 

2.60 

1 

© 00 

CTj © 

ft Ct 

© to 
© to 

ft CM 

23 

ft CM 

28 
ft ei 

85 

ft CM 

23 

ft CM 

23 

-t CM 

J5 2 

00 to 

ft' CM 

2 8 

ft CM* 

CM © 

00 cp 

ft CM 

88 
ft CM 

S8 

ft CM 

S3 

f 0M 

© « 
b* ft 

-4 CM 

2 00 j 

2.66 ! 

© -f 

O to 

ft CM 

00 CM 
© to 

— CM 

N C» 

05 to 

CM 

to © 

05 tO 

ft CM 

33 

ft CM 

co ft 
© to 

1-4 CM 

S3 

ft CM 

S3 

f CM 

25 

"sir 

ft CM 

8 S 

-» ei 

ao 8 

ft CM 

3$ 

ft CM 

© -O 

0M 
ft CM 

s? 

<N CM 

o 8 

CM CM 

CM © 

O fc-j 

CM CM 

8 8 

CM CM 

© W 

C5 to 
ft eti 

00 ft 

© to 

ft CM 

r- o* 

© to 
fi ci 

© to 

© IO 

ft CM 

CM ft 

05 IO 
fi CM 

8 5 

ft 0M 

"sir 

ft CM 

b> ft 
00 -MJ 
-t CM 

85 

ft oi 

33 

ft OM 

-%w 

ft cm* 

Tif 

CM CM 

~§§T 

CM CM 

I'. 00 

O l> 

CM CM 

to to 
o c- 

CM CM 

© a 

cm eti 

et © 
© i> 

«N CM 

© 5 

© CM 

8 3 

fi CM 

© 
© to 

ft CM 

© to 
© to 

f CM 

ss 

ft CM 

© Q 
© 8 
f CM 

88 
ft ct 

S3 

ft CM 

28 
ft CM 

3 a 

cm ei 

2 g 

CM ci 

2 3 

CM CM 

ft MB 

-1 op 

CM CM 

© CM 

H to 
CM CM 

oo at 
o t- 
CM CM 

o R 

CM CM 

tO "M< 

o t*. 
cm eti 

CO © 

c to 
CM CM 

o 8 

CM CM 

83 

© CM 

2 8 
ft 0M 

2 8 
ft 0M 

© eo 
© to 
ft CM 

S3 

ft CM 

2 22 

3.05 

! 

CM 3 

CM ci 

s g 

CM 0» 

~2 g 

CM CM 

b- to 

ft CR 

CM CM 

2 3 

© CM 

rft ft 

ft 0» 

CM CM 

- 5 

CM 0M 

28 
CM CM 

«9 £ 

© 0M 

b- © 
O b* 
CM CM 

© 00 
© b- 
© CM 

co at 
o to 
© CM 

S3 

© 0M 

© 3 

© CM 

sa 

CM « 

"sir 

CM M 

05 00 

© ft 

CM « 

b- io 

CM ft 

CM M 

25 

CM W 

Ttt at 
CM © 
CM « 

CO t- 

CM © 

© « 

S 3 

CM w 

2 8 
CM CM 

^8 

CM CM 

S3 

CM CM 

’is~ 

© CM 

S3 

© CM 

S3 

© CM 

Ts 
© © 

2 42 

i 3.44 

1 

ft CM 
Mil 

CM 0» 

§3 

CM « 

85 

CM CO 

b- 

co n 

cm n 

© ft 

co w 

CJ M 

© © 
© CM 

CM M 

~23~ 

cm ti 

88 

CM « 

© fe¬ 
et ft 

CM W 

b- -# 
© ft 

© «i 

© ft 
© ft 

© M 

22 
© « 

© 

© O 
© oi 

S3 

© oi 

s to 
to e- 

© W 

S3 

CM « 

to CM 
to b* 

cm oi 

38 

CM « 

S 3 

CM 14 

ft CM 
© to 

CM W 

© 3 

CM «i 

GO 10 
ttf to 

CM 0> 

© ft 

rt* »0 
CM 00 

3 0 

CM CO 

5 3 

CM *0 

^3 3 

© N 

2 * 
© M 

33 

© CM 

~5 8 

© oi 

So a 

CM y» 

a a 

CM -M* 

S3 

CM MU 

S3 

CM * 

S3 

CM <b 

S3 

CM M* 

S 8 

CM bi 

S3 

CM 

S 8 

CM 00 

S3 

CM M 

5 3 

CM CO 

S3 

© M 

S3 

© M 

s a 

©* oi 

sp 

© tfi 

3.20 

6.10 

s § 

eo to 

2 8 
oo ib 

5 © 

CO M» 

28 

co «b 

28 

m tb 

38 

co 

S 8 

CO 

35 

© 8 

eo *0 

§8 

eo * 

3S 

CO to 

S3 

eb W 

88 
eo tb 

~ss~ 

CM -b 

»o ft 
O op 
t* *• 

ss 

-0 C-I 

25 

-M< «4 

S3 

tr t4 

88 

•Mt 

23 

co *4 

00 ft 
© © 
© *4 

28 

to to 

S3 

co to 

2 3 

eo m» 

© co 
© « 

© © 
00 t- 
co ib 

CO « 

38 

co tb 

33 

co tb 

3 

00 

8 

2 

s 

2 

s 

s 

§ 

© 

S3 

s 

1 

§ 

i 

8 


519 



APPENDIX IY 


Values of r fob Various Values of z from 1 to 3 1 

Each figure in the body of the table is preceded by a decimal point. 

Example: When z has a value of 1.23, r has a value of 0.8426 


z 

.00 

.01 

.02 

.03 

.04 

.05 

.06 

.07 

.08 

.09 

0.0 

0000 

0100 

0200 

0300 

0400 

0500 

0599 j 

0699 

0798 

0898 

0.1 

0997 

1090 

1194 

1293 

1391 

1489 

1586 

1684 

1781 

1877 

0.2 

1974 

2070 

2165 

2260 

2355 

2449 

2543 

2036 

2729 

2821 

0 3 

2913 

3004 

3095 

3185 

3275 

3304 

3152 

3540 

3627 

3714 

0 4 

3800 

3885 

3969 

4053 

4130 

4219 

4301 

4382 

4462 

4542 

0 5 

4621 

4699 

4777 

4854 

4930 

5005 

5080 

5154 

5227 

5299 

0 6 

6370 

5441 

1 5511 

5580 

5649 

5717 

5784 

5850 

5915 

5980 

0 7 

6044 

6107 

6169 

6231 

6291 

0351 

0411 

6469 

0528 

6584 

0 8 

6040 

0096 

6751 

6805 

6858 

6911 

6903 

7014 

7064 

7114 

0 9 

7163 

7211 

7259 

7306 

7352 

7398 

7443 

7487 

7531 

7574 

1 0 

7616 

7658 

7699 

7739 

7779 

7818 

7857 

7895 

7932 

7909 

1 1 

8006 

8041 

8076 

8110 

8144 

8178 

8210 

8243 

8275 

8300 

1 2 

8337 

8307 

8397 

8426 

8453 

8483 

8511 

8538 

8565 

8591 

1.8 

! 8617 

8643 

8608 

8692 

8717 

8741 

8704 

8787 

8810 ; 

8832 

1 4 

8854 

8875 

8896 

8017 

89,5? 

8957 

8977 

8996 

9015 i 

1 

9033 

1.8 

9051 

9069 

9087 

9104 

9121 

9138 ! 

9154 

9170 

9186 

9201 

1.6 

9217 

9232 

9246 

9261 

9275 

9289 I 

9302 j 

9310 

9329 j 

9341 

1.7 

9354 

: 9366 

9379 

9391 

9402 

9414 

9425 1 

9130 

9447 

9458 

1 8 

94081 

94783 

94884 

94983 

95089 

95175 

95208 

95359 

954 19 

95537 

1 9 

95624 

95709 

95792 

95873 

95953 

90032 

90109 

96185 

96259 

90331 

2 0 

96404 

96473 

96541 

96609 

96675 

90739 

90803 

00865 

90920 

90980 

2 1 

97045 

97103 

97159 

97215 

97209 

97323 

97375 

97420 

97477 

97526 

2 2 

97574 

97622 

97668 

97714 

97759 

97803 

97846 

97888 

97929 

97970 

2 3 

98010 

98049 

98087 

98124 

9S101 

98197 

98233 

98207 

9830 J 

98335 

2 4 

98367 

98399 

98431 

98462 

98492 

98522 

98551 

j 98579 

98607 

98035 

2 5 

98061 

98088 

98714 

98739 

98764 

98788 

98812 

98835 

98858 

98881 

2.6 

98903 

98924 

98945 

98906 

| 98987 

99007 

99020 

99045 

99004 

99083 

2 7 

99101 

99118 

99136 

99153 

99170 

99186 

99202 

99218 

99233 

i 99248 

2.8 

99203 

1 99278 

99292 

99306 

99320 

99333 

99340 

! 99359 

99372 

99384 

2 9 

3.0 

3.6 

4 0 

4 5 

5 0 

5 5 

6 0 

6 6 

99396 

99505 

998178 

999329 

999753 

999909 

9999000 

9999877 

09999548 

99408 

99420 

99431 

99143 

99454 

99161 

994 75 

99485 

99495 


For greater accuracy* or for values beyond the table. 


e a * - 1 1 -f r 

r - jr+l 2 - 1-15129254 log.o 

1 

* Taken by permission, with minor additions, from R. A. Fisher, “Statistical Methods for 
Research Workers," Oliver Sc Boyd, Edinburgh, 1938. 
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Alpha sub three («*), and computa¬ 
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as measure of skewness, 204 
Alphas (a), definition of, 192 
in normal distributions, 192 
of probability data, 197 
standard error of, 245 
Amplitude of cycle, 396 
Analysis of variance, 264-285 
computation of, 275 
degrees of freedom in, 281 
purpose of, 269 
results interpreted, 283 
summary of computations, 284- 
285 

Approximate numbers, addition and 
subtraction of, 18 
computation with, 15 
division of, 16 
multiplication of, 16 
a priori probability, 156 
Areas under normal curve, table of, 
514 

Arithmetic average (see Mean) 
Array, 66 
Attributes, 56 
Average deviation, 130 
standard error of, 248 


Averages, 61-125 
advantages and disadvantages of 
various, L09-123 
arithmetic (see Mean) 
choice of, summarized, 123 
desirable characteristics of, 106 
with grouped data, 82-125 

computation of, summarized, 
104-106 

moving (see Moving averages) 
of position, 75 

computed from ogive, 97 
relationships between, 73, 108, 201 
with ungrouped data, summar¬ 
ized, 79 

B 

Base period of index numbers, 422, 
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Beta coefficients, 506 
Betas (0), definition of, 207 
standard error of, 249 
Bias in index numbers, 423, 428 
Biassed errors, 8 
Binomial expansion, 160-163 
Bowley's measure of skewness, 202 
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for standard deviation, 143 
Chi-square test, 221 
interpreting results of, 226 
charts for, 225, 228 
rules for applying, summarized, 
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Class interval, 33 
choice of, 48 

unequal, when to use, 49 
Class limits, 28 
overlapping, 32 
Class mark, 33 

locating in frequency table, 54 
Class mid-point, 33 
Coefficient, of alienation, 465 
of correlation, 447 
computation of, 454 
formulas for, 455 
interpretation of, 462 
standard error of, 451, 472 
of determination, 469 
of regression (see Regression coeffi¬ 
cients) 

of variation (or variability), 152 
standard error of, 249 
Combining statistics of subgroups, 
264-268 

formulas for, 268 
Compensating errors, 8 
Compound interest law (C.I.L.) (see 
Semilogarithmic curve) 
Computation with approximate 
numbers, 15 

Computations, checking accuracy of 
(see Charlier check) 

Confidence interval, 258 
Continuous data, 56 
and mode, 69 
Corrected prices, 431 
Correlation, 442-510 

coefficient of (see Coefficient) 
curvilinear, 481 
degrees of freedom with, 484 
standard errors in, 487 
joint, 501 
multiple, 493-510 
computation of, 502 
curvilinear and linear, 500 
regression equations, 495, 505 
simple linear, 442-492 
coefficient of (r), 447 
computation of, 457 
corrected, 450 


Correlation, simple linear, depend¬ 
ence on normality of distribu¬ 
tions, 472 
formulas for, 455 
of grouped data, 474 
standard error of, 451, 472 
^-transformation with, 452 
Correlation tables, 475 
Cumulative frequency tables, 35 
Curve fitting, 340-382 
differencing as aid in, 372-375 
distorted axes in, 376-380 
distorting data in, 375-376 
selection of curve types for, 369- 
382 

summary of methods, 380-381 
Curves, common types of, 342 
formulas for, 343 

frequency (see Frequency curves) 
Gaussian, 164 

humpbacked (mound-shaped), 41 
Laplacian, 164 

logarithmic (see Logarithmic 
curves) 

Poisson, 215, 218 
reciprocal (sec Reciprocal curves) 
Curvilinear relationships, 297 
Curvilinearity, discovery of, 340 
Cyclical movements, 387 
amphtude of, 396 
nature of, 394 
period of, 395-399 

D 

Data, continuous, 56, 69 
discrete, 56, 69, 96, 114 
distorting, in curve fitting, 375- 
376 

diurnal movements in, 387, 397 
grouped, arithmetic mean with, 84 
averages with, 82-125 
computation of, 105-106 
deciles and quartiles with, 103 
geometric mean with, 99 
harmonic mean with, 100 
median with, 93 
mode with, 98 
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Data, grouped, percentiles with, 104 
quadratic mean with, 104 
simple linear correlation of, 474 
standard deviation with, 139 
heterograde and homograde, 56 
historical (see Historical data) 
nonmatbematical, median with, 
114 

probability, arithmetic mean of, 
157, 197 

selection of, for index numbers, 
439 

ungrouped, arithmetic mean with, 
61 

averages with, 79 
deciles with, 76 
geometric mean with, 70 
harmonic mean with, 71 
median with, 66 
mode with, 68 
percentiles with, 104 
quadratic mean with, 74 
quartiles with, 75 
standard deviation with, 136 
Deciles, definition of, 76 
with grouped data, 103 
with ungrouped data, 76 
Degrees of freedom, 271-275 
in analysis of variance, 281 
concept explained, 271 
with curvilinear correlation, 484 
and frequency distributions, 224 
with regression lines, 450 
Dependence of variables, 328-333, 
494 

Dependent events, 159 
Dependent variable, 298, 329, 494 
Determination, coefficient of, 469 
Deviation, average, 130, 248 
standard (see Standard deviation) 
Differences, first, 372 
second and third, 373 
significance of, 253 
standard error of, 250 
Differencing as aid in curve fitting, 
372-375 

Direct relationship, 294 
Discrete data, 56 


Discrete data, and median, 96, 114 
and mode, 69 
Dispersion, 127 
measures of, 126-154 
relative, 149 

sizes of measures in normal dis¬ 
tributions, 145 

Distorted axes in curve fitting, 376- 
380 

Distorting data in curve fitting, 375- 
376 

Diurnal movements in data, 387, 397 
Division of approximate numbers, 16 

E 

Empirical probability, 157 
Erratic movements, 387 
Error, accidental, 9 
biassed and compensating, 8 
constant and cumulative, 9 
grouping, of arithmetic mean, 92 
corrections for, 195 
of moments, 194 

persistent, random, systematic, 
and unbiassed, 9 
probable, 242 

standard (see Standard error) 
Estimate, errors of, 309, 320, 442 
Excess (see Kurtosis) 

Expansion, binomial, 160-163 
Experimental method, 1 
Exponential curve (see Semiloga- 
rithmic curve) 

Extrapolation, 325 
of second-degree parabola, 352 

F 

F, 281 

interpretation of, 283 
tables of, 516-519 
Fiducial interval, 261 
Fiducial probability, 258 
First differences, 372 
5 per cent point, 256 
of F, 282 

tables of, 516-519 

Freedom, degrees of (see Degrees of 
freedom) 
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Freehand trends, linear, 303 
of second-degree parabola, 345 
Frequency curves, 38 
common shapes of, 41 
goodness of fit of, 221 
Pearsonian, 211, 212 
Poisson, 215 

Frequency distributions, 26-60 
combining measures from, 264 
degrees of freedom and, 224 
J-shaped, 41, 217 
moments of (see Moments of fre¬ 
quency distributions) 
U-shaped, kurtosis in, 207 
Frequency histograms, 36 
Frequency polygons, 37 
Frequency tables, 26jf, 

choosing class interval in, 48 
cumulative, 35 
locating class marks in, 54 
logarithmic, 53 

number of classes in, 44, 47, 146 
rules for making, 47-57 
summarized, 57 

with unequal class intervals, 49-54 
G 

Gaussian curve, 164 
Geometric mean, 70 

advantages and disadvantages of, 
116 

with geometric progressions, 117 
with grouped data, 99 
with positive skewness, 120 
with ungrouped data, 70 
Geometric progression and geo¬ 
metric mean, 117 

Goodness of fit of frequency curves, 

221 

Grouping error, with arithmetic 
mean, 92 

corrections for, 195 
with moments, 194 

H 

Half-life, concept on, 359-362 
formulas for, 362 


Harmonic mean, 71 
advantages and disadvantages of, 
120 

with grouped data, 100 
with ungrouped data, 71 
Heterograde data, 56 
Histograms, 36 
Historical data, 386-418 
residual movements in, 387 
trend lines in, 322 
types of movements in, 386 
Homograde data, 56 
Humpbacked curves, 41 

I 

“Ideal” index numbers, 438 
Independent events, 159 
Independent variables, 298, 329, 491 
Index of seasonal variation, 402-411 
Index numbers, 420-439 
aggregative, 421 
averages of relatives, 422 
base period for, 422, 434 
bias in, 423 
weight, 428 

choosing formula for, 437 
correcting prices with, 431 
“ideal,” 438 

selection of data for, 439 
uses of, 429 
weighting of, 424 
Inverse relationship, 295 

J 

J-shaped distributions, 41, 217 
Joint correlation, 501 

K 

Kelley’s measure of skewness, 203 
Kurtosis, 206 

in rectangular distributions, 207, 
208 

in U-shaped distributions, 207 

L 

Laplacian curves, 164 
Laws, discovery of, 381 
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Least squares, and arithmetic mean, 
110 

with logarithmic curves, 365 
with reciprocal curves, 367 
with second-degree parabolas, 348 
with semilogarithmic curves, 358 
with straight lines, 307, 311 
with third-degree parabolas, 353 
lieptokurtosis, 206 
Less-than frequencies, 36 
Linear relationship, 297 
Link relatives, 408, 435 
Logarithmic curves, 362-366 
formulas for, 362-363 
interpretation of, 363 
least-squares method, 365 
normal equations for, 365 
selected-points method, 364 
logarithmic frequency classes, 53 
Logarithmic paper, 378 

M 

Maximum point of second-degree 
parabola, 352 
Mean, arithmetic, 61 

advantages and disadvantages 
of, 109 

checking accuracy of computa¬ 
tion, 91 

combining from subgroups, 1 11, 
261 

with grouped data, 84 
grouping error with, 92 
and least squat or, 110 
of probability data, 157, 197 
short method for, 87 
standard error of, 236 
with ungrouped data. 61 
weighted, 63 
progressive, 392 
Mean square, 274 
Measure of skewness, as, 204 
Bowley’s, 202 
Kelley’s, 203 
Pearson’s, 201 

Measurement, accuracy of, 7 


Measures, of dispersion, 120-154 
of reliability, 233-261 
(See also Standard error) 
sizes of, in normal distribution, 
145 

of skewness, 200-203 
Median, 66 

advantages and disadvantages of, 
113 

with discrete data, 96, 114 
with grouped data, 93 
with nonmathematical data, 114 
from ogives, 97 
with open-end classes, 114 
standard error of, 245 
with ungrouped data, 66 
Mesokurtosis, 206 

Minimum point of second-degree 
parabola, 351 
Mode, 68 

advantages and disadvantages of, 
115 

computed from as, 205 
with discrete and continuous data, 
69 

with grouped data, 98 
with ungrouped data, 68 
Moments of frequency distributions, 
187 

adjusted, 195 
Oharlier check for, 193 
computation of, 189 
crude, 195 

of probability distributions, 196 
More-than frequencies, 36 
Mound-shaped (humpbacked) 
curves, 41 

Movements, cyclical (see Cyclical 
movements) 

diurnal, in data, 387, 397 
random, 387, 413-416 
Moving averages, 389 
advantages and disadvantages of, 
393 

period of, 391 
and seasonal variation, 399 
Multiple correlation ( see Correla¬ 
tion, multiple) 
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Multiplication, with approximate 
numbers, 16 
of probabilities, 159 
Mutually exclusive events, 159 

N 

Nature of cyclical movements, 394 
of relationship, 291, 446 
of statistics, 1-6 
Negative relationships, 295 
Normal distributions, 163 
areas under, 167, 514 
dependence of correlation coeffi¬ 
cient on, 472 
described, 165 
fitted, by areas, 181 
by ordinates, 177 
formula for, 166 
ordinates of, 515 
tests for, 172 
unit, 177 

Normal equations, for logarithmic 
curves, 365 

for reciprocal curves, 367 
for second-degree parabolas, 348 
for semilogarithmic curves, 358 
for straight lines, 311-312 
when origin is at center, 324 
for third-degree parabolas, 353 
Normality, concept of, 416 
Notation, scientific, 14 

O 

Observation equations, 305 
Ogive, 39 

and averages of position, 97 
common shapes of, 43 
median computed from, 97 
test for normality of distribution, 
172 

1 per cent point, 256 
of F, 282 

tables of, 516-519 
Open-end classes, 32 
and median, 114 
Ordinates of normal curve, 615 


Origin of time series, 322 
Overlapping class limits, 32 

P 

Parabolas, 344 
second-degree, 344 
extrapolation of, 352 
freehand trends of, 345 
least-squares method for, 348 
maximum and minimum points 
of, 351 -352 

normal equations for, 348 
selected points, 345 
third-degree, 352 
normal equations for, 353 
Parameters, 241, 342 
Part and partial correlation, 508 
Pascal's triangle, 162 
Pearsonian coefficient of correlation, 
447 

Pearsonian frequency curves, 211 
fitting Type III, 212 
Pearson’s measure of skewness, 201 
Percentages, standard error of, 247 
Percentiles, 77 

with grouped data, 104 
with ungrouped data, 104 
Period, of cycles, 395 -399 
of moving average, 391 
Platykurtosis, 206 
Point binomial, 160, 164 
skewness of, 198 
Poisson curves, 215 
formula for, 218 
Polygon, frequency, 37 
Position, averages of, 75, 97 
Positive relationship, 294 
Price relatives, 423 
Probable error, 242 
Probability, 155 
curve of, 164 

elementary theorems of, 159 
empirical, 157 
a priori , 156 
statistical, 157 
Probability paper, 172 
data for constructing, 175 
Progressive mean, 392 
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Q 

Quadratic mean, 74 
advantages and disadvantages of, 
122 

with grouped data, 101 
with ungrouped data, 74 
Quartiies, 75 
with grouped data, 103 
standard error of, 248 
with ungrouped data, 75 

R 

r, coefficient of correlation, 447 
in relation to z, 520 
Random movements, 387, 413-416 
Range, 127 

Reciprocal curves, 366-369 
formula for, 366 
least-squares method for, 367 
normal equations for, 367 
selected points, 367 
Reciprocal paper, 378 
Rectangular distributions, 43 
kurtosis of, 207, 208 
Regression, concept of, 331-332, 471 
Regression coefficients, 319, 330 
in multiple correlation, 497 
and r, 455 

Regression equations, computation 
of, 457 

in multiple correlation, 495 
Regression lines, 301 
degrees of freedom with, 450 
freehand linear, 303 
interpretation of results, 316 
Relationship, curvilinear, 297 
direct, 294 
inverse, 295 
linear, 297 

methods of finding, 294 
nature of, 291, 446 
negative, 295 
positive, 294 
Relative dispersion, 149 
Relative frequency, standard error 
* of, 247 


Relatives, link, 408, 435 
price, 423 

Reliability, measures of, 233-261 
(See also Standard error) 

Residual movements in historical 
data, 387 
Residuals, 320 

Root-mean-square deviation (see 

Standard deviation) 

Rounding off numbers, rules for, 22 

S 

Sample, 233 
Scatter diagram, 297 
Scattergram, 297 
Scientific laws, discovery of, 381 
Scientific method, 1 
Scientific notation, 14 
Seasonal movements, 387, 397 
elimination of, 411 
Seasonal variation, index of, 402- 
411 

and link relatives, 408 
and moving average, 399 
Second-degree parabola (see Parab¬ 
ola) 

Second differences, 373 
Secular trend, 386, 389-394 
Selected points, method of, 304 
with logarithmic curves, 364 
with reciprocal curves, 367 
with second-degree parabolas, 
345 

with semilogarithmic curves, 
356 

with straight lines, 304 
with third-degree parabolas, 353 
Semi-interquartile range, 128 
standard error of, 248 
Semilogarithmic curves, 354-362 
fitting of, 356 
formula for, 354 
and half-life, 359 
interpretation of, 355, 358-359, 
363 

least-squares methods, 358 
normal equations for, 358 
selected points, 356 
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Semilogarithmic paper, 377 
Sheppard’s corrections, 195 
Sigma {see Standard deviation) 
Significant figures, 10 
Skewed curves, 41 
Skewness, and alpha sub three (a*), 
204 

and divergence of averages, 203 
and geometric mean, 120 
measures of, 200-203 
and Poisson distribution, 217 
standard error of, 249 
Standard (root-mean-square) devia¬ 
tion, 135 

Charlier’s check for, 143 
with grouped data, 139 
meaning of, 144 
of probability data, 157, 197 
short method for, 141 
standard error of, 244 
with ungrouped data, 336 
Standard error, of alphas {a), 245 
of arithmetic mean, 236 
of average deviation, 248 
of beta sub two (/3 2 ), 294 
of coefficient, of correlation, 447 
of variation, 249 
of curvilinear correlation, 487 
of differences, 250 
of estimate, 445 

computation of, 457 
corrected, 450 
in multiple correlation, 508 
of median, 245 
of percentages, 247 
of quartiles, 248 
of r, 451 

of relative frequency, 247 
of semi-interquartile range, 248 
of simple linear correlation, 451, 
472 

of skewness, 249 
of standard deviation, 244 
of sum, 253 
of 453 

Standard notation, 13 
Standard units, 170, 332n. 

Statistic, 241 


Statistical method, 2-4 
Statistical probability, 157 
Statistics, nature of, 1-6 
Straight lines, fitting of, 289-337 
Sturges’ rule, 47 

Subtraction of approximate num¬ 
bers, 18 

Sum, of squares, 274, 279-280 
standard error of, 253 
Systematic error, 9 

T 

Theorems of probability, elementary, 
159 

Third-degree parabola (see Parabola) 
Third differences, 373 
Trend, elimination of, 333-336 
secular, 386 
Trend lines, 301 
freehand linear, 303 
short cuts with historical data, 322 
Triangle, Pascal’s, 162 
Type 111 frequency curve, fitted, 212 

U 

U-shaped frequency distributions, 42 
kurtosis of, 207 
Unbiassed error, 9 
Unequal class intervals, 49 
and averages of position, 78 
with confidential data, 50 
how to use, 51 
Unit normal curve, 177 
Units, standard {see Standard units) 
Universe, 233 

V 

Variability, 126 
coefficient of, 152 
Variables, 56 

dependence of, 328-333, 494 
Variance, definition of, 147 
analysis of {see Analysis of vari¬ 
ance) 

combining subgroups, 148, 264 
computations of, 281 
and degrees of freedom, 224 
division of, 266 
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Variance, between and within 
groups, 206 

Variation, coefficient of, 52 
W 

Weekly cycles, 398 

Weight bias in index numbers, 428 


Weighted arithmetic mean, 63, 111 
Weighting of index numbers, 424 

Z 

^-transformation, 452, 473 
in relation to r, 520 
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